Production machine learning models should be monitored for data changes, anomalies, outliers and drift to detect unexpected behavior.
When a machine learning model is deployed to production, it becomes a part of the production application. It changes context from training environment to production stack. As a result, issues in deployed machine learning models need to be addressed in the context of the production application. One important difference between model training and production environments are data dependencies.
Datasets for model training can have many forms: local files, S3 buckets, APIs and other sources. This very much depends on the use case. Use cases can range from one-time manual training to fully automated retraining. In most cases, however, the model prediction/scoring will be happening in a different environment. For example, a fraud detection model is trained with features loaded and built from a database, but predict on real-time data in a web application serving API requests and responding with a probability of the input data being fraudulent.
Dependencies in many applications are typically:
From the model perspective, dependencies can be direct, i.e. implemented by the script performing inference. An example is a database query to fetch data that will be transformed and passed to the model's
predict() method. For model servers that serve saved models as an API, dependencies are implemented in downstream components that consume model server's scoring REST API endpoint. Examples of such model servers are TensorFlow Serving, TorchServe and KFServing.
It is practical to distinguish between code and data dependencies in order to properly setup testing and monitoring.
Code and infrastructure issues are with us for decades. There are many mature practices and tools to address them, including advanced monitoring and observability solutions that provide log management, metric collection and tracing to detect and troubleshoot such issues.
Data issues in the context of model serving is a relatively new dimension of failures. A challenge with data issues is that, unlike code exceptions, they are hard to detect. Being fully data-driven, models introduce the risk of silent failures; invalid model input data will not cause exceptions, but lead to some garbage model output. I discuss this topic in more details in my other article Monitoring Machine Learning Models for Silent Failures.
In real-world applications data dependencies are not constant, they may change, for example, with application releases. Releases fix bugs, add product features, add new services, change configuration, modify data handling and so on. In particular, they may change model's data dependencies and/or their internal behavior. This is another aspect to be taken into account.
Model development process differs from software development process. Based on the model type and use case, model training can be manual or automated, run locally or in a pipeline. In other words, model deployment is a subject to model development process schedule rather than application release cycle.
This means that from the application perspective a model can be replaced with a new version independently of the application's CI/CD and releases. And while data dependencies may stay unchanged, the model itself changes. Although the model was validated and tested before deployment, the new version can still cause issues with unseen production data.
The evolving nature of production applications introduces a new requirement of constantly monitoring model input and output data for consistency and quality, in order to detect and resolve data issues.
While during training we can run tests against training datasets, doing so becomes challenging for large-scale, online production applications, where models are serving prediction requests 24/7. In those cases running dataset tests becomes impractical, because the dataset is actually a continuous stream of data. And depending on how prediction requests are handled, i.e. one at a time or in batches, prediction data windowing is necessary so that statistics can be computed.
To better understand what needs to be represented as a monitorable metric, it is helpful to categorize possible data issues. The following list is in no way complete.
With a goal of detecting and resolving data issues listed above, there are two main capabilities a model monitoring system should provide:
A data monitoring system has two parts; logger/agent that computes time window or data batch statistics and records samples, and a server part, where those statistics are monitored for issues and visualized.
For some applications it might be good enough to manually implement such monitoring system based on ELK stack or Prometheus along with a logger/agent for computing data statistics. An ML-specific and integrated solution like Graphsignal is another option, if end-to-end, out-of-the-box system is more beneficial.
When setting up monitoring for model serving applications in production, it is important to consider it in the context of the production application environment and not only as part of the machine learning training environment or process.