A guide to evaluating AI systems.

Once you have formalized your problem, before you actually start building a solution, you will want to be able to evaluate how well your system is doing. There are two main ways to do so: using automatic metrics and using human evaluation, which we detail below.

If you have any further questions about your particular use case, suggestions about AI guide content, etc., please feel free to get in touch at any time!

Automatic Evaluation Paradigm

The most common way to evaluate the performance of an AI system is to use an automatic evaluation paradigm. This is where you have a set of system outputs along with gold-standard outputs that you can use to evaluate the performance of your system.

AI Evaluation

In setting up such an evaluation framework, you will mainly need to think of two things: what metrics you will use to evaluate your system, and how you will collect the gold-standard dataset (which will be covered in the following AI data creation section).

Automatic Evaluation Metrics

The metrics that you use to evaluate the performance of your system will depend highly on the task that you’re tackling, and specifically the type of output that you’re expecting.

For example, let’s say that we are trying to build a system that can identify spam posts on a social networking site. Some common metrics for this task include:

  • Accuracy: the percentage of overall predictions that are correct, including both “spam” and “not-spam” predictions
  • Precision: the percentage of “spam” labels that are correct, among all the “spam” labels that were predicted by the system
  • Recall: the percentage of “spam” labels that are correct, among all the “spam” labels that were actually present in the dataset
  • F1 score: a combination of the precision and recall scores, specifically the harmonic mean of the two values

Each of these metrics has a different function. Accuracy tells you how well you are doing overall, but if only a small percentage of your data is spam, then you may be able to get a high accuracy score by simply predicting “not spam” for everything. In this case, precision and recall are more useful, as they tell you specifically what proportion of false positives (“non-spam” posts classified as “spam”) and false negatives (“spam” posts classified as “not spam”) you are getting. F1 score is a combination of the two, and is often used as a single metric to evaluate the performance of a classification system.

However, these metrics mainly work for classification tasks where the output is a single label. There are other accuracy metrics that are more useful for other types of tasks.

  • Regression: For regression tasks, where the output is a continuous value, it is common to use “mean squared error” (MSE) or “mean absolute error” (MAE) as the metrics.
  • Ranking: For ranking tasks, where the output is a list of items, it is common to use “mean reciprocal rank” (MRR) or “normalized discounted cumulative gain” (NDCG) as the metrics.
  • Text Generation: For text generation tasks, where the output is a sequence of words, it is common to use metrics that match the output to a gold-standard output.
  • Image Generation: For image generation tasks, where the output is an image, similar reference-based metrics can be used.

Obviously this is an incomplete list – there are many other AI tasks and many other metrics that can be used to evaluate even the tasks noted above. Feel free to either do your own research on the most appropriate metrics, or get in touch with us and we can suggest some metrics for you as well!

Automatic Metrics for Generative Tasks

In particular, for tasks that involve generating complex outputs such as text or images, evaluation is particularly difficult because there is rarely a single “correct” output. For instance, if you were to translate the sentence “生成されたテキストの評価は難しいです” from Japanese to English, there are many possible correct outputs, such as “Evaluating generated text is difficult”, “Evaluation of generated text is hard”, “Text generation evaluation is challenging”, etc. Similarly, any number of images could be generated from the same input, and it is difficult to say which one is “better” than the others.

In these cases, the quality evaluation itself becomes a complex task, and sometimes you even need an AI system to perform auditing and quality control of the AI output itself! In order to help solve this problem, at Inspired Cognition, we have created a simple API-based toolkit called Critique that can be used to perform quality evaluation for these sorts of generative tasks for purposes such as output filtering, system monitoring, and human-in-the-loop evaluation.

If this sounds like something that you might be interested in, click over to the Critique docs page to learn more, and get in touch with us with any questions!

Success Measures other than Prediction Performance

One other important thing to note is that although the metrics mentioned above mainly focus on the prediction performance of your system, there are other things that you should also consider when evaluating your system.

  • Latency: How long does it take for your system to make a prediction? This is important when you may want to make predictions in real-time.
  • Computational Cost: How much memory or CPU/GPU power does it take to run your system? Often this computational cost will translate to a cost in terms of money if you need to build out server infrastructure or hire cloud computing services when you deploy your system.
  • Fairness: How well does your system perform on different groups of people? This is important, as you will usually want to make sure that your system is performing fairly for different groups of people.
  • Robustness: How well does your system perform on data that is different from the training data? In practical situations this is paramount, as you will want to make sure that your system’s performance doesn’t drastically reduce when it is exposed to unseen data.

Human-in-the-loop Evaluation Paradigm

For many applications, simply testing your models on a static test set is not enough. For instance, if you are building a dialogue system, you may have a static test set to measure certain components of the system (such as its ability to recognize the user’s intent, or its ability to predict the next utterance), but you will also want to evaluate the system as a whole. In this case, you will need to get the system to interact with real users, and measure the overall performance of the system. Some other examples include:

  • Human evaluation of the subjective quality of the generated text or images.
  • A/B testing of the AI model as a component in a system interacting with real users.

These sorts of evaluations are important, but they are also time consuming and also need to be done every time you develop a new system. Because of this, it may only be feasible to perform them for major updates of your system.

Next Steps

Once you have an idea how you can evaluate your system, the next step is to think about creating data for testing and possibly training your model.