A guide to monitoring, diagnosing, and debugging AI models.

Once you have developed and/or deployed an AI-based system, you will want to monitor it to ensure that it is performing as expected, identify any issues that may arise, and identify the source of any problems that you may encounter.

In this section, we will discuss the topics of monitoring, diagnostics, and debugging, which are all highly related but distinct concepts.

If you have any questions, comments, or would like expert help in setting up a model diagnostics pipeline for your own applications, please feel free to get in touch!


Monitoring is the process of collecting data about the performance of your system and analyzing it to identify any regressions in performance.

In general, you will want to monitor any performance metrics that you decided were important in designing the system evaluation (such as accuracy, latency, or throughput).

Real-time Monitoring of AI Performance

In order to monitor the performance of a deployed AI system in real-time, you will need to collect a variety of data.

The easier metrics to collect are latency and throughput, which can be collected by monitoring the performance of the system itself. For example, if you are using a microservice architecture, you can monitor the latency and throughput of each microservice call.

The more difficult metrics to collect are metrics regarding the accuracy of the system. These metrics are more difficult to collect because they require a gold-standard dataset, which is often not available in real-time. However, there are a few ways to get around this problem.

The first way is to collect a gold-standard dataset and use it to evaluate the system periodically. This is a good solution if you have the resources to do this periodically. However, if this is not feasible, for AI system that are used as part of a larger user-facing system, you can use metrics from the larger system as a proxy. Some easy-to-collect examples would be the user click-through rate of recommendations from a recommendation system, the number of times a generated image is downloaded from an image generation system, or overall engagement metrics with a web site.

AI Monitoring Tools

Monitoring of various performance metrics can be done using standard monitoring dashboards such as Grafana.

However, there are also a number of tools that are specifically designed for monitoring AI systems.


AI systems have colloquially been referred to as “software 2.0”, for their potential impact on future software development. Similar to traditional software development, AI system development also involves a process of debugging, which mainly aims to identify where systems fail in terms of accuracy, fairness, security, and other problems.

AI system debugging is rather challenging since the AI systems currently in use are usually opaque (for example, deep neural network-based systems are considered a a black-box). Therefore, sometimes, even though the system gets a decent result, it is hard to understand why or where it is working and failing, which makes it more difficult to know what’s wrong with the system and how to further improve it.

An Example of AI Debugging: ExplainaBoard

At Inspired Cognition, we are building a tool called ExplainaBoard that helps with this AI debugging process, and we will use it as an example to illustrate how debugging tools can help with AI system development.

In order to use ExplainaBoard, you generate system predictions using one or more AI systems and upload them to the interface. The interface will then evaluate the predictions and provide a number of different visualizations that can help you understand the behavior, as well as the strengths and weaknesses of the system.

Let’s take a look at an example of how this works. First, you will need to generate predictions using one or more AI systems for whatever task you’re interested in. As an example, we will use the task of named entity recognition, which is the task of identifying named entities such as people, places, or organizations in text. ExplainaBoard, and debugging in general, is most effective when you have multiple system output results to compare, so we prepared two different systems for this task, both tested on the standard [conll2003] dataset:

  • seqlab: a system based on labeling individual words
  • spanpred: a system based on predicting entity spans.

We can then use the submit these system outputs to ExplainaBoard, click on the buttons next to them and click “pairwise analysis” and we’ll get a comparison of the two different systems. You can see the ExplainaBoard tutorial for more details, but the more important thing from the point of view of this guide is what we can learn from using AI debugging tools such as this.

First, an AI debugging tool will certainly provide you with a holistic view of the system performance, and we can see this analysis from ExplainaBoard below. We can see that spanpred_conll achieves somwhat higher scores than seqlab_conll.

Holistic Performance Analysis

However, this doesn’t give us a whole lot of actionable information that would allow us to improve the system, so ExplainaBoard also provides a breakdown of some insights of where one system may be doing better than the other. Some examples of insights that ExplainaBoard provides in this particular case is:

  • Although spanpred_conll achieves better overall performance, it performs worse on the buckets ‘LOC’ and ‘MISC’ of feature span_true_label.
  • Although spanpred_conll achieves better overall performance, it performs worse on the buckets ‘(3.0, 3.0)’, ‘(4.0, 6.0)’ of feature span_length.

In other words, the span prediction model is doing better overall, but it is performing worse on entities that correspond to locations or other miscellaneous entitiy types, and it is also performing poorly on entities that 3 words or longer. We can see a further breakdown of these insights in the following images:

Subsegmented Performance Analysis

There is also other functionality such as viewing and comparing confusion matrices, drilling down into individual examples, etc.

But the most important thing about AI debugging tools such as ExplainaBoard is that they allow you to move beyond a single performance metric and make it possible to dig deeper, forming the basis for iterative system improvement. For instance, in the case above, because our more performant system is failing at location and miscellaneous entities or longer spans, we could consider new data creation, modeling and training methods that would target these problem areas.

Debugging Capabilities

There are a number of capabilities that a debugging tool may provide for you:

  • Holistic Performance Analysis: A holistic view of the performance of the system according to task-appropriate metrics.
  • Subsegmented Performance Analysis: A breakdown of the performance of the system across different subsegments of the data.
  • Prediction Attribution: Visualization that shows how the system is making its predictions.
  • Prediction Statistics: Visualization that shows various characteristics of the predictions, such as the label distribution for predicted labels in classification tasks.
  • Counterfactual Evaluation: Visualizations that show how the system would perform if the input was manipulated in some ay.
  • Multi-system Comparison: Visualization that shows how systems compare to other systems.

Also, because AI is subdivided into a very large number of tasks, some tools are applicable to different tasks; many work mainly for classification, while others work easily for NLP or vision tasks.

Debugging tools

There are a number of tools that can help with AI debugging, below is an (incomplete) list of some of the tools that are available along with their coverage of the capabilities listed above.

Tool Tasks Holistic Subsegmented Attribution Statistics Counterfactual Multi-system
ExplainaBoard Classification, Regression, NLP    
TruEra Classification, Regression  
Responsible AI Toolbox Classification, Regression  
Learning Interpretability Toolkit NLP    

Next Steps

Once you have used monitoring or debugging techniques to identify potential issues with your model, you can then use the insights obtained to perform iterative system improvement.