Machine learning models deployed to production introduce new failure types undetectable by traditional tools.
While a few years ago the main focus of machine learning community was on training and experimenting with models, today trained models make it to production and often become business critical. This introduces a new set of failure types that are rare when operating traditional applications.
To be able to address those new types of possible failures DevOps and MLOps teams have to understand how running an ML model is different from running a web application or service. And the key here is to start thinking not only in terms of the the algorithms (or models in this case), but the data they act on. It is true that almost any software processes data in some way, however, the extent to which the control flow is influenced by data is limited. Coded algorithms are sensitive to a valid input data only and throw exceptions or ignore data otherwise. There are cases where algorithms also react to invalid data, but that is rather a bug than a normality.
ML models run on data and they are sensitive to all input data. The data propagates forward through the model graph and produces prediction/score output data that is used, for example, to classify an input. So what would happen if input data suddenly changed, e.g. one of the inputs only had zeros all the time due to a new release in upstream API. In many cases "nothing" would happen, the model would produce garbage output that would be treated as valid by the rest of the application. And that's the problem. Many companies report having similar issues undetected for weeks and even months.
Here is a more comprehensive list of possible problems with ML models in production:
I've only listed issues that are rather operational and should be acted on immediately. There are a number of issues, mostly domain specific and long term that model developers and data scientists should pay attention to as well. Examples are seasonal data drift, model bias and model fairness.
Commonly used monitoring tools cover traditional failures only. The most commons are:
|Exceptions and errors||Log management and error monitoring|
|Capacity issues||Infrastructure monitoring|
|Downtime||Availability monitoring, synthetic monitoring|
In addition to the existing tools, the new stack should also include machine learning model monitoring tools. These should make sure that models operate normally by monitoring at least the following aspects of the deployed models:
In machine learning and MLOps literature the term ML monitoring may also include domain specific model evaluation, seasonal data drift and concept drift, explainability and interpretability, bias and fairness evaluation, adversarial attacks and so on. In this article I address operational monitoring only, which has a purpose to make sure models are operating normally by detecting sudden issues and making model data analysis possible.
Many practitioners implement model monitoring in production using existing tools such as Prometheus and Grafana or similar commercial offerings. These tools are designed for system and performance metric reporting, requiring that the data statistics for each feature and/or class is computed prior to sending it as a generic metric. Some tools support histograms that could theoretically work for reporting feature distributions, bus since these histograms are designed for reporting latency (e.g. have log-spaced bins), they are not suitable for arbitrary values.
It should also be noted that metric-based dashboards do not support ML model serving semantics limiting troubleshooting and analysis capabilities.
And finally, properly windowing prediction instances before computing statistics is also necessary.
It is a good practice and often a requirement to log all predictions that model makes including its input and output data. A more advanced way to store predictions would be a feature store.
Prediction logs may be necessary for offline/regular model performance analysis and retraining, although other data sources may be used for retraining too. However, logs might not be very efficient for detecting sudden failures for various reasons:
Creating such pipelines involves setting up new set of tooling and requires significant effort.
Quoting the "Introducing MLOps" book:
In terms of challenges, for large-scale machine learning applications, the number of raw event logs generated can be an issue if there are no preprocessing steps in place to filter and aggregate data. For real-time scoring use cases, logging streaming data requires setting up a whole new set of tooling that entails a significant engineering effort to maintain.
The good news is that the ecosystem of new tools specializing on data and model monitoring is evolving. There is a number of new startups addressing various challenges of running ML models in production.
With more than a decade of experience building monitoring and profiling tools, while intensively relying on machine learning models, we've built Graphsignal to help ML Engineers, MLOps/DevOps/SRE teams and data scientists address the operational aspects of running and maintaining models in production.
Graphsignal logger takes care of precomputing feature and class statistics and reporting to Graphsignal cloud for detection and analysis.