Model monitoring in large-scale applications may be challenging due to high volume of predictions that need to be analyzed with low issue detection time.
Production machine learning models should be monitored for data and model issues such as data anomalies and drift. I discuss why in my other blog post. Design and properties of the monitoring system mainly depend on the use case. There are also different ways to implement such model monitoring system. In this article I discuss two main implementations. I call them engineering and data science approach respectively. But first I'll need to introduce a few common concepts.
Model serving is usually implemented as a web server, which exposes a REST API for model inference/prediction. Typically, it loads and runs the model file using the same machine learning library the model was trained with, for instance TensorFlow or PyTorch.
Other than providing APIs, models can just be run against batches of data in streaming pipelines or scheduled jobs.
In either case, the input and out data for model prediction/inference, e.g. input image and output probability, are usually saved to be processed later for model performance analysis, labeling, retraining, etc. This is a part of model development and improvement process.
At the same time, the predictions need be checked for data validity, issues and anomalies. Any exceptions needs to be reported to make sure both data and model have no operational issues.
The data logging itself can be done in the same script, or, if not possible, in the downstream service that requests predictions.
If the model is deployed to production as an API, the monitoring system will see the stream of predictions. Typically, for performance and statistical reasons, streaming data is windowed before processing, for example into few-minute or hourly time windows.
To detect drift and anomalies detector-specific baselines are necessary. Detection algorithms may require a special type of a baseline, for example, consisting of some preprocessed statistics or just raw data for some time interval in the past. Some algorithms may even require a set of periodic baselines for subsequent intervals.
Unlike traditional application monitoring, model monitoring focuses on data. One way to understand how data changes and detect failures is to track distributions. Tracking distributions is essential for tabular data, but is also useful for other data such as some properties of text and images.
Data scientists work in notebooks, they usually have access to full or partial datasets. Typical flow consists of loading the dataset, processing it (e.g. training a model) and storing the output.
In case of a monitoring system a similar approach can be taken. A script or even a headless notebook can be run periodically. Assuming that all predictions have been stored when serving, the script will load predictions for current time window as well as the necessary baselines (previously saved statistics or raw data). It will then apply detection algorithms, e.g. anomaly and drift detectors, store reports and send alerts, for example via slack webhook.
This approach has no limitations in terms of application of detection algorithms. Some open source anomaly and drift detection libraries are already available.
Biggest challenge with this approach is however the amount of data to be processed. If the number of predictions is relatively small, this approach may work fine. Once the amount of data is large enough, processing it in a timely manner becomes and engineering problem.
Quoting the "Introducing MLOps" book:
In terms of challenges, for large-scale machine learning applications, the number of raw event logs generated can be an issue if there are no preprocessing steps in place to filter and aggregate data. For real-time scoring use cases, logging streaming data requires setting up a whole new set of tooling that entails a significant engineering effort to maintain.
I've spent last decade designing and implementing monitoring and profiling systems for large-scale applications. There was usually too much raw data to apply detection algorithms directly. One possibility is to use streaming algorithms or sketches instead of raw data algorithms. Sketches are special type of algorithms for computing various statistics iteratively in real-time. One important feature of sketches is that they are mergeable, meaning that a statistic for multiple time intervals can be merged to represent larger intervals.
One example of a sketch is a quantile sketch. It allows to iteratively track data distribution for any number of data instances with very low memory footprint and the ability to get quantiles and other statistics from the sketch only. Quantile sketches can be further used to track distribution changes.
This approach solves the big data problem, however, it requires more efforts to implement. Instead of just saving prediction data in model serving code, the data sketches should be computed for time windows and sent/saved instead. Implementing detection jobs/scripts may require more work either.
The good news is that some out-of-the-box data profiling and monitoring systems already exist and address this exact problem. One such monitoring platform that we've developed is Graphsignal.
When implementing a monitoring system for machine learning models in production, it is important to consider the implications of prediction volume and the size of data on MTTD (mean time to detect). Operational model monitoring has the goal of detecting production data and model issues as early as possible. Low MTTR (mean time to repair) and SLAs are the critical requirements.