A guide to iterative improvement of AI systems.

Once you have a first draft of a system working in production, you may be dissatisfied with its performance in some way. Perhaps its predictions are wrong too often, its latency is too high, or its memory footprint is too large.

In general, you will need to choose whether to go in small steps, medium hops, or big jumps. The bigger the step forward you take, the more effort it will cost, but also potentially the bigger upside reward.

We list a number of potential steps forward below, and which one of these steps you should choose depends on a number of factors:

  • What are the biggest problems right now? First you need to know what issues your system has. And every AI system has at least some issues! Here model debugging methods are your friend, as they will allow you to systematically examine the issues with your models.
  • How important is the problem? Even if your system is imperfect, it may be “good enough” already, achieving its purpose without any major embarrassment for the team. If this is the case, then “doing nothing” is also a perfectly acceptable course of action.
  • What is your time/cost budget? The answer for how ambitious of a goal to set will differ greatly depending on whether you’re a small startup with a few busy full-time software engineers and no AI experts, or a big organization with several AI research scientists.

Obviously, this process is one of the most tricky ones because it varies a huge amount on a case-by-case basis. You can use your best judgement, but we also encourage you to describe your scenario to us and the Inspired Cognition team would be happy to help out!

Anyway, the following is not an exhaustive list, but it is ordered in roughly increasing levels of difficulty or manual labor cost on the part of the system designer.

Writing Pre/post-processing Rules

It is possible that your data is not in a format that is conducive to the model you are using, particularly when there is a mismatch between the input and output formats. In this case, pre- or post-processing, transformations that turn the data into a format that is more suitable for the model, can be very helpful.

In particular, pre-processing can be used effectively as part of the iterative model refinement process if you identify particular subsets of the data where the model is under-performing but can easily be fixed by simple transformations. Just to give one example, it is common for NLP models to underperform on text that is in all caps, so pre-processing rules that convert all text to lowercase and fix these errors. When applying these rules, it can be most

Hyperparameter Tuning

If you are using a machine learning model, you may want to try adjusting the various parameters that were used in training the model. These parameters can include the type of model used, the variety of pre/post processing, the model architecture, the type of optimization algorithm, the learning rate, etc.

The most common way to do this is manual hyperparameter tuning, adjusting the parameters by hand to various values based on the intuition of the experimenter. This process takes significant manual effort and expertise, but can also be relatively efficient in the case where the experimenter has a good intuition for the problem and can rely on their past experience. In addition, if there is a particular type of error uncovered in your model debugging, you can try methods that have been shown to be effective for that type of error.

It is also possible to perform automatic hyperparameter tuning using AutoML software. Many cloud providers also offer this as a service, such as AWS AutoML and Google Vertex AI Hyperparameter Tuning. There is also open-source software such as Hyperopt, Optuna, or Vizier. One caveat about these products is that they (1) often require a large number of runs to be effective, which can result in large computing bills, and (2) they tend to cover a relatively small design space of the hyperparameters that can be easily expressed, which can result in missing out on important other design decisions.

Training Data Creation

One of the cheapest and most certain ways to increase your accuracy is by creating more training data. This can be as easy as sampling more inputs from your production system and sending them to labelers, or using data augmentation techniques, as you likely did when you first created your system.

In particular, once you have an initial system, you can perform selective annotation, in which you label a particular subset of the data where you know the system is underperforming. For example, if in your debugging process you have identified a particular class of inputs where your model is underperforming (such as mis-classifying images with glare or poor lighting), you can create a dataset of only those inputs and send them to labelers.

Another option that handles this process more formally is active learning. Active learning is a technique where you use your model to predict the labels of unlabeled data, and then you use some features of those predictions to decide which data points to send to labelers. Usually you want to select examples that are both:

  • Uncertain: The model is not confident about its prediction. Selecting uncertain examples ensures that the model gains information through the annotations, and consequently reduces uncertainty.
  • Representative: The example is similar to many other unlabeled examples. This ensures that you are not just picking irrelevant outliers, but rather examples where, if labeled, they will provide information useful to other examples in the unlabeled set.

For a more extensive overview of active learning, see this tutorial, and feel free to get in touch with us if you’d like help setting it up for your own organization!

System Combination and Ensembling

One very common and relatively reliable way to improve your system is to combine multiple models together. This is often called ensembling. There are many ways to do this:

  • Averaging: The simplest way to do this is to simply average the predictions of multiple models, specifically by averaging either the probabilities or log probabilities of each.
  • Weighted Averaging: You can also weight the predictions of each model by some factor, and then average the predictions.
  • Stacking: Another common way to do this is to use stacking, which uses the predictions of one model as the input to another model.
  • Voting: You can also use a voting scheme, where you simply take the majority vote of the predictions of each model.

Ensembling is effective because different models tend to make different mistakes, and combining multiple models together can help smooth out some of the idiosyncrasies of each one. Because ensembling relies on the predictions being different, it is especially effective if the models are trained on different data or use different architectures.

One major disadvantage of ensembling is that it can be computationally expensive to use multiple models when deploying your system.

Model Compression

While most of the techniques above are focused on improving the accuracy of your models, there is also a class of techniques that are focused on improving the computational or memory footprint of models. There are a number of common methods to use:

  • Pruning: Pruning is a technique where you remove some of the weights from your model. This can be done by removing the weights that have the smallest magnitude, or by removing the weights that have the smallest contribution to the overall loss. Pruning can be effective because it can reduce the number of parameters in your model, which can reduce the amount of memory and computation required to run the model.
  • Quantization: Quantization is a technique where you reduce the precision of the weights in your model. This is usually done by reducing the number of bits used to represent each weight. Quantization can be effective because it can reduce the number of bits required to represent each weight, which can reduce the amount of memory and computation required to run the model.
  • Distillation: Distillation is a technique where you train a smaller model to mimic the predictions of a larger model. Distillation can be effective because it can reduce the number of parameters in your model, and it can also be used to convert an ensembled model into a single one, reducing the speed disadvantages of ensembling mentioned above.

Experimenting with new Modeling or Training Methods

It is also possible to try out new modeling or training methods beyond the scope of standard hyperparameter search. The distinction between these two is not extremely clear (nor need it be), but in general modifying hyperparameters will not require changing your underlying codebase, whereas experimenting with new methods will. Because of the greater time investment required, and the increased uncertainty of making such changes, it is often best to do this seriously after you have exhausted the other options above.

It also opens up a reasonably large design space, and so it is important to have a good intuition both about the problem you are trying to solve, the options that are available to you, and which of these options are likely to be effective. These intuitions are generally built up with experience, so this path is most recommended for experienced practitioners. Much of the information covered in our AI Model Prototyping guide may be useful here, but once you have an initial system you will need to pay special attention to techniques that are likely to improve your system above what you already have.

Performing Multi-task Learning

Finally, particularly in neural network models, it can be useful to perform multi-task learning. This is a technique where you train a single model to perform multiple tasks on different datasets. This can be effective because it can allow you to leverage the information from one more abundant dataset to improve the performance on another. There are two commonly used paradigms.

The first is where you may train a model to solve multiple supervised tasks at the same time. For example, you may train a model to perform both image classification and image segmentation. This is relatively straightforward if you already have labels for both tasks, and also is relatively intuitive: if you can think of two tasks that seem highly related you can just train a model on both of them and see if it helps.

The second method is based on self-supervised learning, which is a technique where you train a model to perform a task without any labels. For example, you may train a model to predict the next word in a sentence, or to predict the next frame in a video. The big advantage of this is that you can train a model on a large amount of unlabeled data, possibly even trained directly on inputs similar to ones that you would like to handle in your system.

If you would like to learn more you can start with this tutorial on multi-task learning.