Monitoring generative AI with Critique.

This page describes how you can use Critique to monitor an AI system to ensure its performance.

What is AI Monitoring?

AI monitoring is the process of monitoring the outputs of an AI system to ensure that it is performing as expected. Some common use cases are:

  • Monitoring quality changes: Identifying when the system performance is improving (e.g. due to upgrades) or degrading (e.g. due to bugs or data drift).
  • Detecting underperforming user segments: Identifying when the system is producing outputs that are of lower quality for a particular segment of users or inputs.

System Monitoring

In order to perform this monitoring, you need to have a quality criteria that you want to monitor the system against. For more standard machine learning tasks such as classification or regression, the simplest measure is to measure the accuracy or error against a test set.

However, for generative AI systems, defining this “accuracy” is more difficult, as there is no clear way to assess free-form generated inputs such as text or images. This is where Critique comes in. You can use any of the Critique assessment criteria to define the quality of the system’s outputs, choosing the most appropriate one for your task.

Code Example for System Monitoring

The following code shows how you can simply use Critique to monitor the performance of a working system, let’s call it ai_system, over time. Specifically, we’ll take the example of a system that generates summaries of articles, and we’ll calculate the predicted quality of a subsampling of all of the articles that we summarize, and report the average quality of the subsample. If you notice that the value decreases gradually or sharply over time, that could be a sign of an issue, and you can investigate further.

from inspiredco.critique import Critique
import datetime
import os
import random

# Create a Critique client and desired configuration
client = Critique(api_key=os.environ["INSPIREDCO_API_KEY"])
metric = "bart_score"
config = {
    "variety": "source_to_target",
    "model": "facebook/bart-large-cnn",
    "language": "eng"
}

articles_to_check = []
while True:

    # Receive a new article from whatever source we are monitoring
    article = get_new_article()
    summary = ai_system.summarize(article)

    # Randomly pick 1% of the articles to check
    if random.random() < 0.01:
        articles_to_check.append({"source": article, "target": summary})

    # If we have 100 articles to check, check them and report the average quality
    if len(articles_to_check) == 100:
        evaluation = client.evaluate(
            metric=metric,
            config=config,
            dataset=articles_to_check,
        )
        print(f"Average score at {datetime.now()}: {evaluation['overall']['value']}")