In this blog, you'll learn about MLOps from the perspective of a software engineer. This blog will help you build a solid mental model about MLOps from your existing knowledge of software engineering and DevOps.
Who is this blog for?
Say you’re a software developer, DevOps, or an SRE, unless you've been living under a rock, there's a high chance you've been bombarded with jargon about LLMs (Large Language Models), inference, ML, ML pipelines, RAGs, etc. In this blog, we want to provide enough context to help you build a mental model from your existing knowledge of Software Engineering.
Familiarity with DevOps practices is all that we expect from you!
Before we start, a spoiler
MLOps is just your usual DevOps with an added Data Engineering and ML flair.
Yes, that’s it!
But it's not that simple…
Figure: A Simplified MLOps Workflow
You’ve seen this infinity wheel before… Where does this extra loop come from? Lets explain it in a jiffy.
The Challenge with ML Systems
Traditionally, Software systems have well-defined inputs, processing rules (encompassing programming and application logic), and outputs. These systems are built from distinct components such as User Interface, Business Logic, Database, and Infrastructure. Their strength lies in their explicit programming, which enables them to perform specific tasks in a predictable manner.
In contrast, machine learning (ML) systems are fundamentally of a different nature. The components of an ML system include Data Acquisition, Data Preparation, Model Training, Model Evaluation, Model Deployment, Model Monitoring, and Infrastructure.
Figure: Software Systems v/s ML Systems
Instead of relying on explicitly programmed rules, ML systems focus on learning patterns from data and utilizing this knowledge to make predictions or decisions. The power of ML systems stem from their ability to adapt and learn from data, enabling them to perform tasks in a dynamic and often more accurate manner than traditional software systems.
However, this very strength also presents a significant challenge: the performance and reliability of ML systems are dependent on the quality and representativeness of the data, the appropriateness of the model, and the correct implementation of the code. Any of these elements - data, model, or code - can betray the system, leading to unpredictable or undesirable outcomes.
Figure: Treat ML systems as model+data+code.
Knowledge Silos are unavoidable
The DevOps movement aimed to bridge the knowledge gap between Development and Operations teams, and MLOps follows a similar philosophy, but with even larger silos to overcome.
Figure: The Convergence of Development, Machine Learning and Operations
Data Engineers specialize in curating datasets, acquiring data, and validating it, while ML Engineers focus on fine-tuning models, understanding data trends, and maintaining model accuracy.
Data Scientists and ML Researchers excel in developing ML algorithms and models.
Software Teams (Software Engineers, SREs, DevOps Practitioners, etc.) specialize in developing, delivering, scaling, and maintaining traditional software applications.
MLOps philosophy aims to bridge these knowledge silos. Allowing for seamless collaboration and communication between different teams and domains and ultimately leading to more efficient and effective ML systems development and deployment.
A 10K Feet Overview of MLOps
The MLOps life cycle, much like SDLC is a cyclical process involving three primary stages:
Figure: MLOps Live cycle
For a more detailed view of this section refer to https://ml-ops.org/content/mlops-principles
Design: This phase focuses on defining the problem, identifying use cases, and ensuring data availability. It involves requirements engineering and use-case prioritization.
Model Development: This stage encompasses data engineering, ML model engineering, and model testing and validation. It is where the machine learning model is built, trained, and evaluated.
Operations: Once the model is developed, it moves to the operations phase, which includes ML model deployment, CI/CD pipelines, and monitoring and triggering mechanisms. This ensures the model is effectively deployed and its performance is tracked.
MLOps Culture: DevOps' Cooler Cousin
Alright, folks, remember how DevOps culture revolved around CI/CD. MLOps is like DevOps bitten by a radioactive data scientist.
It brings more members into the Continuous Everything Club.
The Continuous Everything Club
Continuous Integration (CI) extends the testing and validating of code and components by adding testing and validating data and models.
Continuous Delivery (CD) concerns with delivery of an ML training pipeline that automatically deploys the ML model prediction service.
Continuous Training (CT) is unique to ML systems property, which automatically retrains ML models for re-deployment.
Continuous Monitoring (CM) is concerned with monitoring production data and model performance metrics, which are bound to business metrics.
MLOps Automation: Enhancing Machine Learning Workflows
In DevOps, we automate Configuration Management, Infrastructure Provisioning and application deployment. All of this works great for code. But, in ML we also need to automate how our models are periodically trained, how the data is cleaned, etc.
Figure: MLOps Automation vs DevOps Automation
What to automate? Why Automate and How to Automate?
Much like automation in a software process we strive for additional automation goals with MLOps processes.
The Key Challenges that MLOps Automation addresses are:
ML Models are difficult to test as the outputs keep changing with time.
Manually changing datasets and retraining becomes difficult as more and more data is incorporated into a system.
ML Models are statistical which means traditional testing methods such as unit tests, integration tests, smoke tests, etc while necessary, are not sufficient. We need to incorporate additional testing methods in our CI pipelines.
As a model serves more and more queries from users, it learns and adapts, which means two kinds of challenges can happen.
A. Concept Drift:
Concept Drift is when the underlying patterns in data change over time, making a previously trained ML model less accurate.
Concept drift is like your friend changing their personality.
For Example, Let's say your robot was designed to recommend outfits based on the weather. It learns that people wear warm jackets when it's cold. But then, a new fashion trend emerges where people start wearing light jackets even in the peak winter season. Your robot is confused because the relationship between weather and clothing choice has changed.
This is concept drift. The underlying rules or patterns (how weather affects clothing choice) have changed, not just the data itself.
B. Data Drift:
Data Drift occurs when the distribution of the input data changes over time.
Data drift is like your friend changing their wardrobe.
Lets Imagine you build a fashion advisor robot. You teach it to recommend outfits based on what's popular and trending based on photos of people. Over time, fashion trends change. Suddenly, everyone's wearing neon colors and oversized sweaters, which your robot hasn't seen before. This means the data your robot learned from (old fashion photos) doesn't match what's popular now (new fashion trends).
This is data drift. The kind of data hasn't changed (it's still photos of people), but the content of the data has (different styles).
So, while data drift is about changes in the data you're seeing, concept drift is about changes in how that data relates to the thing you're trying to predict.
The 3-Level MLOps Automation Framework
The MLOps SIG (Special Interest Group), have suggested a 3-level maturity model to assess the maturity of Automation in an ML System.
Manual Everything (Level 0): You do everything yourself, perfect for initial exploration. Every step in each pipeline, such as data preparation and validation, model training, and testing, is executed manually. The common way to process this is to use Rapid Application Development (RAD) tools, such as Jupyter Notebooks.
Automated Training (Level 1): Models train automatically on new data, with validation checks. We introduce CT (Continuous Training) i.e. retrain the model whenever new data is available. We also include steps for data validation and model validation.
CI/CD Pipeline (Level 2): Everything (data, models, training) is automatically built, tested, and deployed.
Figure: ML pipeline with CI/CD
(Credits: https://ml-ops.org/content/mlops-principles)
Versioning and Experiments
In MLOps, we not only version our code but also the data and models. These changes are tracked and given as inputs to pipelines.
What are we versioning in MLOps?
Data: Because your model is only as good as the data you feed it.
Models: Different architectures, different results. You need to test your new model if there is a change in the way it is built and/or trained.
Experiments: Very much like real-world experiments each pipeline run is used to build, validate, and test your models, data, and code. Thus, the name.
Why maintain versioning?
Reproducibility, Collaboration & Debugging. The basic premise of versioning remain the same as DevOps. But, in this case, it's not just for code, but, also your data and model.
Experiment Tracking: Your ML Lab Notebook
Think of it as a journal of your work. We keep tabs on:
Hyperparameters: The knobs and dials of your model.
Metrics: How well (or terribly) your model is performing.
Artifacts: The stuff your experiment spits out, your, final model.
Code: Because, duh.
Why do you need experiment tracking?
Comparison: Find out why one experiment crushed it while another face-planted.
Analysis: Spot trends and patterns across your experiments.
Reproducibility: (Yeah, it's important enough to mention twice.)
Bottom line: In ML, your code, data, and experiments are all joined at the hip. Keeping track of all this stuff isn't just nice-to-have, it's a must-have. Unless you enjoy banging your head against the wall trying to figure out why your model suddenly started predicting that cats are actually toasters.
Steps Ahead
So far, we have discussed what are our prerequisites when dealing with ML Systems. Next, we are going to deep dive into what tools and technologies to look out for when designing your LLM Applications and building MLOps and LLMOps practices around them. Subscribe to our blog below!
If your organization needs help in building and scaling your systems be it MLOps or something else, don't forget to reach out to us here.