Monitoring Model Performance in Production
By Dmitri Melikyan | | 2 min read

Monitoring machine learning model accuracy and other performance metrics in production is essential to detect and address data and model issues.

Just like traditional web applications and microservices, model serving and inference jobs should be monitored in production. However, while non-ML applications are mainly monitored for response time and resource utilization, performance of a machine learning model is very different and use case specific. In this article I highlight some of the performance metrics for ML models as well as show how to easily monitor them using Graphsignal.

Ground Truth Availability

Machine learning use cases can be very different. In one case the prediction made by a model can be immediately validated. An example would be a click of a user on a recommended job. In other cases, such as image classification for user uploaded images, the ground truth may be available only after labeling the images.

There are also situations when ground truth is available very late and is not usable for monitoring purposes. In such cases monitoring model's inputs and outputs becomes even more critical. I discuss data monitoring in my other article Model Monitoring in Production: Data Perspective.

Model Performance Metrics

Based on model type, different metrics need to be monitored to make sure that the model isn't broken, e.g. after deployment, and that the model's input features have expected values.

Here is a short, but in no way complete list of metrics for different model types:

  • Classification: accuracy, precision, recall, F1-score
  • Regression: MSE, RMSE, MAE
  • Ranking: MRR, Precision@k

Monitoring Model Performance with Graphsignal

Continuous model evaluation with Graphsignal is simple. The label and prediction pair can be conveniently logged whenever and wherever it becomes available. Graphsignal logger will take care of windowing the logged data, computing statistics and sending them to Graphsignal. It will transparently work in online and batch modes.

import graphsignal

graphsignal.configure(api_key='my_key')
sess = graphsignal.session(deployment_name='my_model_prod')

...

sess.log_evaluation( 
  prediction=False,
  label=True)

See the Quick Start guide guide and API reference for more information.

The model-specific metrics are then available in the model performance dashboard.

model performance dashboard

To monitor and alert on issues with accuracy, an alert rule can be setup on Alerting page.

Monitoring Accuracy for Data Segments

In order to monitor and visualize accuracy per data segments, e.g. age group, country, etc., segments can be provided.

sess.log_evaluation(
  prediction=False,
  label=True, 
  segments=['seg1', 'seg2'])

Monitoring segment accuracy may be important to detect unseen or underrepresented data segments for which model performance not optimal.

Sign up for a free account to try it out or contact contact us for a short demo.