Machine Learning Observability and Monitoring

Understanding Machine Learning Observability and Monitoring


Machine Learning Observability is beneficial when a machine learning model is deployed in production. Data scientists' primary concern is the model's persistence over time. Is the model still capturing the pattern of incoming data? Is it still performing the same way as it performed during its design phase?

Machine learning model Observability refers to tracking and understanding the model performance in production from both operational data science perspective. Inadequate monitoring can lead to unreliable models left unchecked in production. Old models that do not add business value never get caught.

There are six distinct phases in an ML model's lifecycle, such as model building, Model Evaluation and Experimentation, Productionize Model, Testing, Deployment and Monitoring, and Observability.

  • Model Building: It deals with understanding the problem, data preparation, feature engineering, and initial code formulation.
  • Model Evaluation and Experimentation: It deals with feature selection, hyperparameter tuning, and comparing different algorithms' effectiveness on the given problem.
  • Productionize Model: It deals with the preparation of the "research" code for deployment.
  • Testing: Testing deals with ensuring that production code behaves in the way we expect. Its results must match with the predicted results in the Model Evaluation and Experimentation phase.
  • Deployment: It refers to getting the model into production to start adding value by serving predictions.
  • Monitoring and Observability: This is the final phase, where we have to ensure our model that is predicting the way we expect it to in the production

In the diagram, notice the cyclical viewpoint. The information collected in the "ML Monitoring & ML Observability" phase feeds backward to the "Model Building." It means the data collected during the monitoring phase immediately providing back into training data for model updates.

Monitoring Scenarios

The first scenario is the deployment of a brand new model. The second scenario is where we replace the current model with a completely different model. The third scenario is making small changes to the current live model. For instance, there is a production model, and one feature becomes unavailable, so there might be a need to re-deploy that model, removing that feature. Alternatively, we need to develop a new feature that will be awesomely predictive. We want to re-deploy the model taking the new feature as an input. Regardless of these situations, monitoring is how we determine whether changes done in the model give the desired results.

Why is it hard to monitor ML models?

ML monitoring is hard traditionally because of the behavior of Machine learning models governed by rules specified in the code. Considering the challenges that occurred while productionizing the model, the system's behaviors need to be tracked on three parameters:- data, model, and code.

Code & config monitoring takes on additional complexity and sensitivity in an ML system due to:

  • Configuration: Usually, model hyperparameters, versions, and features are controlled in the system config. A slight error can cause a completely different system behavior that won't be tracked with traditional software tests. This is true in systems where models are constantly iterated on or changed.
  • Entanglement: If there is any change in an input feature, then the importance, weights, or use of the remaining features may also shift. This issue is also known as the "changing anything changes everything." For this, the ML system feature engineering and selection code need to be tested carefully.

Why is ML Monitoring required?

As we all know, every production-level software is prone to failure or degradation over time. It becomes essential to monitor the performance to prevent future issues. Typically, we monitor the software's performance, whereas, in the case of Machine Learning models, we need to monitor the predictions' quality.

The degradation of the model over time is known as model drift or model decay. The degradation in a model's prediction ability happens due to various reasons such as:-

Data Skews

Data skew occurs when the Machine Learning model is trained on the data that does not represent the live data. The data used to train the model locally does not represent the data that live system on which the model has to generate insights.

There are various reasons why this can happen:

  • Training data is designed incorrectly: Distributions of the training data variables are different from the real data variables' distribution.
  • Non-availability of feature in production: Sometimes, at the production level, the model has to predict data on which it is not trained. This means that we have to remove the feature, change it with some other variable in production, or re-create that feature by combining other production features.
  • Mismatch in Research/Live data: Data on which model is trained in the research environment comes from one source, and the real data comes from a different source. This means that the pipeline returns the same prediction for the same input data. Different data sources may lead to different inherent values in the same features, resulting in different predictions.
  • Data Dependencies: Machine learning models usually use variables that are created or stored by other systems. Suppose there are any changes in the system on which data is generated or the code implementation of a feature changes. In that case, it will produce slightly different results, or might the definition of a feature change. For instance, an external system may adjust the voting age from 21 to 18. If voting age is an important feature in the model, this will change its overall prediction
  • Upstream data changes: Changes in upstream data refer to operational data changes in the data pipeline. It is very common in practice that the data science team usually doesn't control every system from where input data comes. For instance, a completely siloed data engineering team changes a variable from Fahrenheit to Celcius even though the rest of the data is still stored in Fahrenheit. This slight change in one variable might impact an overall if one tried to find the average temperature.
  • Data Integrity: Data inconsistencies can often go unnoticed in deployed AI systems. Data science teams need to detect feature errors like missing values, type mismatch, and range anomalies to reduce overall issue resolution time.

A Holistic Strategy

Hopefully, this article gives you a much clearer idea about what monitoring for machine learning means and why it matters. We advise following the below steps: