Typically, when we think about machine learning, we think along the lines of ‘grab a dataset on kaggle, choose and train an appropriate model for the use case, tune the hyperparameters, evaluate on the hold-out, report the 98% accuracy, the job is done’. This is typically true for the research part of ML, however, once the model actually should be used in some real application, the story becomes a bit more complicated.
In this topic, we will look at the intersection of software engineering and machine learning and learn about the aspect that should be kept in mind during the deployment of ML models.
From notebook to deployment
When one is running experiments in a Jupyter notebook, there are rarely thoughts about versioning the code part of the notebook, the hyperparameters, the data, or other artefacts. It makes sense — the notebooks are meant for quick iterations.
Now, imagine getting a new task and remembering working on a very similar problem 6 months ago, coming up with a pretty good solution at the time, and all you have is the training notebook in a repository (the dataset was quite hefty at 5GB, and was kept in a .gitignore file).
You run the notebook, and something doesn’t look right. Just to name a few nuances, it might turn out that the package versions got updated, some libraries have conflicts, your data does not look the same as it did 6 months ago (some new features might have been added, some were removed, some have changed in other ways, etc — thus, your past feature engineering effort goes down the drain), the hyperparameters are not the optimal ones anymore, and quite a few other issues appear further down the line.
This problem could be broadly described as ML reproducibility, and thus, when models should run in a production setting, it’s important to understand what and how exactly one should keep track.
The ML & SE artefacts
Let’s take a quick look at what is typically tracked in more ‘traditional’ software, let’s say, a web application. More or less, its the code base itself (whatever is needed to run the application), likely packaged in a docker container, and database migrations (track the schema changes). While some principles, such as version control and testing, remain relevant, the dynamic nature of ML models calls for the adaption of traditional software engineering approaches.
One significant difference lies in the versioning and reproducibility aspects. In traditional software engineering, versioning the code base is typically enough to ensure reproducibility. With ML models, versioning the model artifacts (e.g., trained model weights, preprocessing pipelines, and data used for training) is also required, because even slight changes in the training data or the preprocessing steps can lead to unexpected model behavior.
Testing strategies also need to be adapted when dealing with ML models. While traditional software testing focuses on verifying functional requirements, testing ML models requires evaluating their performance on representative data sets and monitoring for potential drift or degradation over time. This involves integrating testing frameworks that can handle the probabilistic nature of ML model outputs and accurately measure model performance using appropriate evaluation metrics.
Unlike traditional software systems, which typically exhibit deterministic behavior, ML models can be influenced by changes in input data distributions or environmental factors. The models often fail silently (e.g., no exceptions are thrown, the predictions are generated as expected, but the issues are hidden, and sometimes might be apparent after a long period of time). Continuous monitoring of model performance, input data quality, and potential distribution shifts is necessary to detect model degradation. This may involve implementing automated monitoring pipelines, setting performance thresholds, and triggering alerts or retraining procedures when necessary.
The tool overview: tracking
After outlining the issues, it’s time to see how they might be tackled. It’s important to note that different problems will be solved through different frameworks, since there aren’t really standard ways to manage the entire ML lifecycle as of the time of the writing of this topic, and some tools will be more suitable for a specific problem. Also, due to the apparent divergence between the typical software engineering and machine learning, the issues might be elevated, but a few concerns won’t be addressed.
Starting with monitoring, MLFlow is one of the most popular tools at the moment. MLFlow manages the end-to-end ML lifecycle, from experimentation and tracking to model packaging and deployment. MLflow can track and log experiments, including code, data, models, and metrics. This allows to easily reproduce and compare different runs. Model management is another key aspect of MLflow. It provides a centralized model registry for storing, versioning, and managing models. MLflow also supports model packaging and serving, enabling seamless deployment of models to various serving environments, such as Docker containers or the cloud. MLflow provides tools for monitoring model performance, detecting drifts, and triggering retraining or updating models when necessary. While MLflow offers a comprehensive set of features, there are some potential limitations to consider. For instance, while MLflow provides model versioning and tracking capabilities, it may not offer the same level of granularity and flexibility as dedicated data versioning tools like DVC (Data Version Control). While MLflow supports deployment to various serving environments, it may not provide out-of-the-box solutions for more complex deployment scenarios, such as multi-model deployments or advanced scaling and load balancing requirements.
A minor note on data versioning
Keeping track of the data is important, but storing the datasets in Git is unrealistic due to their volume. DVC (Data Version Control) is a framework providing version control, reproducibility, and collaboration tailored for data and models. DVC enables versioning, tracking, and sharing of data and models alongside the codebase. This is achieved by creating lightweight metadata files that store information about the data and models, without duplicating the actual content. The metadata files are version-controlled using Git. One of the key features of DVC is its ability to handle large datasets and models that may not be suitable for traditional version control systems like Git. DVC provides a remote storage mechanism that allows users to store and retrieve data and models from various storage backends, such as cloud storage services (e.g., S3), network-attached storage (NAS), or remote servers. This ensures that data and models can be easily shared and accessed. DVC also addresses the issue of reproducibility, by capturing the entire pipeline, including data preprocessing, model training, and evaluation steps. This is achieved through the use of DVC pipelines, which are defined as sequences of commands or scripts that can be executed and tracked by DVC.
Conclusion
As a result, you are now familiar with:
How does a regular software system compare to the one that has an ML model integrated;
ML versioning includes a lot of artefacts, that is, not only the code is included, but also the byproducts (such as the datasets or the hyperparameters);
Models tend to fail silently, and require monitoring across multiple dimensions (such as data drift, performance degradation, schema changes, etc);
There are a few tools available for tracking the models and the data, MLFlow and DVC being one of the most prominent ones at the moment.