In the regular ML model training setting, one usually trains the model on the available labeled data, does some hyperparameter tuning on the validation set, and evaluates the final accuracy on the hold-out test set. Sound very convenient, until the model hits production and starts to run on the real-world data. One of the problems with production models is that there are no more ground truth labels to quickly check the performance. Also, some interesting issues, such as the production data not being similar to the data the model was trained on, might arise. The two mentioned issue, among many other, will result in some sort of performance degradation.
In this topic, we will look at the common types of drifts that occur in ML system after the initial deployment.
Data and prediction drifts
Note that in this topic, we only provide an outline of the different drifts, but their definitions might not be the same across different sources, and on top of that, the drifts described below might occur at the same time.
The first thing will we look at is the change in the input feature distribution, known as the data drift (sometimes also known as feature drift). Formally, it can be described as the change in , where is the input feature distribution. It occurs when the model was trained on a certain set with a specific feature distribution, but started to encounter differently distributed features during the inference, so it fails to generalize to the new (previously unseen) data, resulting in a performance drop.
The source of data drift might not be straight-forward to determine. For example, lets say you have a pretty good music recommender system that keeps the user on the streaming app ('pretty good' here means that users tend to listen to the recommendations in full length). Then, one day, the streaming app adds podcasts on the platform and gets an influx of new users who are interested in the podcast part of the service, and the model was trained on the 'regular' music data, so it does not work with the podcast yet. And that is how data drift is introduced, although this is a rather simple example.
Another type of drift is the training-serving skew, which is similar to data drift, but has a much shorter time frame of occurrence (the model shows terrible performance right after it has been deployed, making it resemble the classic 'overfit on the training set - failure on the test set' scenario). It might be caused by the error in the data preprocessing, feature engineering, time delays (e.g., you are doing high-frequency trading, but the APIs lag, and the inference is way too long, causing a lag in predictions, and ultimately, leading to degraded performance).
The prediction drift, on the other hand, refers to the change in the output distribution, . It might or might not co-occur with the data drift. Prediction drift is a bit tricky in terms of model performance indication: considering a fraud detection case, lets say you see a sharp spike in the number of detected fraud cases. The prediction distribution has certainly changed, but is it due to the model starting to output many false positives (due to either data drift or any kind of model-related error) or due to an actual increase in the fraud attempts (for example, its the holiday season, and scammers are more active than usual)? In both of these cases, it's very important to have an alert system at least to actually inspect what might be going wrong.
Concept drift
Concept drift is the change in the relationship between the input and the output (the patterns are changing), that is, the change in . Concept drift does not automatically mean the data drift (that is, the change in ), but they often happen at the same time. Lets say that we are predicting housing prices in a certain area, and trained a model on a historical data with a certain interest rate. Then, the interest rate goes up (typically, this lowers the demands, and the housing price drops). This is an external factor that affected the price, but the housing distribution itself remained, and the conditional distribution has changed. Basically, this is an outside factor that has influenced the underlying patterns.
Concept drift is often seasonal: the selling of a specific product during a certain time of the year, or the price of the plane tickets (where different schemes are present for the weekdays and the weekends), and sometimes, there are different models present for specific seasonal shifts.
Detecting and addressing the drifts
As we mentioned in the introduction, the labels might (and most likely will) not be available in the production setting as opposed to the train setting, so, the question is: how to actually determine the occurrence of the drifts? Through the proxy metrics.
First, you can monitor the input and the output distributions for changes because these are inferable from the training set, and also have an automatic alert system in case there is a significant degree of divergence present such that its possible to inspect the model and the artifacts closer. This refers to monitoring summary statistics (mean, median, variance, etc). But there are issue: first, it's very hard to track multi-dimensional features (because there are too many of them and the relationships between them are not that apparent). The second thing is that summary statistics can be very sensitive to the outliers. Nonetheless, its a quite useful way to detect the drift.
One can do better than summary stats: statistical hypothesis testing or using metrics, such as Kolmogorov-Smirnov (numerical features), the Chi-square test (categorical features), Population Stability Index (PSI), or KL divergence, etc. We will not go far into these methods, but provide a concise summary on their main attributes below:
Feature type | Sensitivity | |
Kolmogorov-Smirnov (KS) test | numerical | Sensitive to even slight shifts in the distribution |
Population Stability Index (PSI) | numerical & categorical | Low outlier sensitivity |
KL divergence | numerical & categorical | Low outlier sensitivity |
Jensen-Shannon divergence | numerical & categorical | Only slightly more sensitive than PSI and KL divergence |
Wasserstein distance | numerical | Medium sensitivity |
Another proxy metric is monitoring model confidence scores.
Now, lets say the drift has been detected. And at this point, you can only retrain the model with the added information, such that it's able to re-adjust. One can go further and do online learning, where the model is retrained either on the drift trigger, or just periodically with a certain frequency (e.g., a day, a month, etc), which we will cover in the next topics.
A comparative table of different drifts
The drift | Short description | Possible actions to solve the drift |
Data drift | The change in the distribution of the input features |
|
Concept drift | The change in the relationship between the inputs and the outputs |
|
Prediction drift | The change in the distribution of the outputs |
|
Training-serving skew | Pretty much the same as the data drift, but occurs immediately after the deployment | Same as for data drift |
Conclusion
As a result, you are now familiar with some of the common types of drifts that occur in production and possible ways to detect and address them.