There is an old legend that starts like this: 'Once you've evaluated your trained model on hold out set, got pretty good metric values, and managed to deploy it, your job is done'. Unfortunately, this legend is not particularly truthful due to multiple reasons. The training happened on clear, historical labeled data, the evaluation was done without much concern for the actual time it takes to feed the model some input and get the prediction, where were no external integrations, etc. While these things are true during training, they are pretty rare in production.
In this topic, we will consider how running ML models in production is different from developing a model during the research phase and look at various foundational concepts of ML system design.
The setup
Suppose we have an API that loads a pre-trained model (with a test accuracy of >97%) from a file and serves predictions (to bring more clarity, suppose it's an online object detector, such as YOLO variant, that receives the data from a CCTV camera and is supposed to draw bounding boxes and re-translate the footage back without any significant delays). After a few hours, we notice various things: first, the delay between the re-translation and the actual time is around five minutes. Then, we notice that not all objects are being detected and assigned their own bounding box.
This is a synthetic scenario, but the variations of such issues are quite common. But the test accuracy was more than 97%? - sure, but during the design phase, certain aspects might have gone unnoticed. You look at the model closely and realize two things: first, the training set only contained labeled data on the sunny days, and also, the model has not been compressed in any way (such as pruned or quantized), the first causing the inability to detect the object and the boxes, and the second being responsible for those huge delays.
And this is a major difference in the approach during training vs during the inference. During the training phase, you want fast training, high accuracy on the holdout, and are training on the stable (not shifting in time) data distribution. In production, the priorities change: now, you want fast inference, low latency, and your data is shifting (meaning that the model is likely fine - it just never encountered the production data).
In the next section, we will consider two settings of prediction serving time: online and batch, and will also look at the streaming and batch features.
The time axis: batch and real-time
Online prediction is when predictions are returned as soon as the request with the input was processed by the model. During online prediction, requests are sent to the prediction service via APIs. When prediction requests are sent with HTTP requests, online prediction is also known as synchronous prediction (predictions are generated in sync with requests).
Batch prediction is when predictions are generated periodically (it's fine to have delay in this setting). The predictions are stored and retrieved upon request. Batch prediction is also known as asynchronous prediction: predictions are generated asynchronously with requests.
During the ML system design, you have to decide whether online or batch prediction should be performed.
Something similar applies to the features: batch features refer to the features from the historical data of the rarely updated ones, and streaming features are updated in a much shorter time-frame (classic example - road traffic estimations, where the data might change every 5 minutes or less).
In batch prediction, only batch features are used, but online predictions can combine both the online and the batch features.
Changes in the distributions
The second apparent issue in our synthetic system was the mismatch between the train and the production-encountered data. This can be generally referred to as data drift, where the underlying distribution of the input features has shifted. This is not the only type of drift that might occur.
After the data drift has been detected, your only option is to retrain the model with the new data, such that it can learn the new distribution and re-adjust itself. Training can be periodic, happen on some trigger, or be done manually.
Still, its system design
Up until this point, we have ignored the regular SWE things. The introduction of the ML model into the system means that now, you can't just version only the code (with git, for example), but also have to track the experiments (different re-trains of the model), store hyperparameters, model-related artifacts (model binaries/ data files), as well as record the model metrics (or proxy metrics, which are indirect measures of model performance in production, when the ground truth labels are often unavailable). When it comes to ML-specific version, the popular options are DBT and MLFlow libraries that simplify the whole versioning process for you.
Another aspect is how the intermediary data is stored. Typically, one goes with CSV during the training of the model (because its the most widely available train format). But there are actually two categories of formats: column-major (e.g., Parquet) and row-major (such as CSV). Column-major formats are much faster for column-based reads (aka, you are reading the features directly). Row-major formats are faster for data writes.
Row-major formats (e.g., CSVs) are more suitable for a lot of writes, and column-major ones are better for doing a lot of column-based reads.
Conclusion
As a result, you are now familiar with the basic settings of ML system design and how it differs from the regular system design:
ML models often perform well during training but face challenges in production due to different priorities: training focuses on accuracy while production requires fast inference and handling shifting data distributions.
Online prediction serves results immediately via APIs using both batch and streaming features, while batch prediction generates results periodically with acceptable delays.
Data drift occurs when production data differs from training data (like a model trained only on sunny days failing in other conditions), requiring retraining on new data or other adjustments.
ML systems need specialized version control for experiments, hyperparameters, and model artifacts, with considerations for appropriate data formats (row-major for writes, column-major for reads).