A guide to deploying AI models in production.

Once you have a model that you are confident in, you will need to deploy the model in a scalable and robust way to make it accessible to users.

If when building your AI system you opted to use a solution that hosts models and provides an API for making predictions then you can mostly skip this section. However, the information may still be helpful to understand what is going on behind-the-scenes, and you may want to switch to a self-hosted solution in the future if the running cost of using an API-based service becomes too high.

If you will be deploying your own models, read on! And please feel free to get in touch with any questions or comments, or if you would like expert help on your own model deployments.

Inference for AI Models

The process of making predictions on new data is called inference. Before getting into the actual deployment of models, it is important to understand a few things about how inference works.

Single-example vs. Batch Inference

One important distinction to make is between whether you’re doing inference over a single input example or a batch of examples.

  • Single-example inference makes predictions over a single input example at a time.
  • Batch inference simultaneously processes many input examples by batching them together and processing them all at once.

Single-example inference is the default method for inference, and has several advantages. First, is simplicity: one example in, one prediction out. Simple methods are easier to implement and less prone to bugs. Second, given sufficient hardware, latency (the turn-around time for providing a response) will usually be lower for inference over a single example. This is because there is no overhead for waiting for multiple samples to be input and batched together, and also because the actual processing of a single example takes somewhat less time end-to-end.

Batch inference is particularly suited for high-throughput applications where you are processing many examples per second and computational efficiency of the overall system is key. By processing multiple examples at a time, it is possible to take advantage of parallelism provided by modern hardware, where batch processing of multiple examples takes less time overall than single-example inference.

This is illustrated in the figure below.

Batch Inference

Hardware for Inference

Before going into the actual options, it is worth discussing the hardware that you need to be performing inference. There are several options:

  • CPU (Central Processing Unit) is the standard chip that powers the rest of the computer or device that is being used.
  • GPU (Graphics Processing Unit) is a specialized processor that is optimized for performing many operations in parallel. GPUs were originally developed for graphics processing, but now are widely used in training and inference for neural networks (NNs), where extensive mathematical operations are necessary.
  • NN-Specialized Hardware such as the Google TPU (Tensor Processing Unit) and AWS Inferentia, are processors developed for the purpose of performing training and/or inference on neural networks. They are available through cloud providers (Google Cloud and AWS respectively), but usually not for on-premise deployment.

There are several important tradeoffs to consider when deciding between these options, as summarized below:

  • Cost: CPUs are by far the least expensive option for very low-throughput applications where the machine is sitting idle for significant periods of time. GPUs and NN-specialized hardware are more suitable for high-throughput applications, and which one is more cost-efficient will depend a lot on the level of throughput and the application.
  • Flexibility: CPUs can be used for any type of computation, but the other hardware is mainly useful for neural network inference (so they are generally not useful for most other models such as linear models or decision trees).
  • Processing Speed: GPUs and specialized chips are generally faster than CPUs for neural network inference, particularly over large batches of data. For single-example inference, the difference is less pronounced. When compared to GPUs, NN-specialized hardware tends to shine on big batches of data with particular types of networks, but also have less flexibility and software support (see below).
  • Memory: CPUs use the memory of the computer that they are running on, while GPUs and NN-specialized chips have their own memory that is separate from the computer’s memory. This on device memory is part of what makes processing fast, but it can also be a limitation in that you need to either get a device that has enough memory to host the model that you want to serve, or you need to split the model across multiple devices (which is often complicated and requires expertise).
  • Software Support: CPUs are supported by all software packages, and GPUs are supported well by any serious software package for neural networks. Support for TPUs or Inferentia chips is much less ubiquitous – they are supported well by software developed by the chip’s manufacturer (Google and Amazon respectively in this case) but not necessarily as well by other software packages.

One basic takeaway is that if you are using methods other than neural networks (such as decision trees or linear models) then CPUs are the standard option because there will usually be little to no benefit in the hardware acceleration of GPUs or NN-specific chips. For deploying neural network based models you are faced with a choice. For low-throughput applications, CPUs may still be a good bet – you don’t want to be forced to pay a lot of money for a GPU or TPU that is sitting idle most of the time! If you have higher throughput, can do batching, or critically need low latency, then serving on GPUs or NN-specific hardware may be the best choice.

It is also worth noting that there are more esoteric options for inference hardware, such as FPGAs, as well as other AI-specialized processors such as Graphcore or Cerebras products. These are not your typical go-to options, but they may be worth exploring if you have a very compute-intensive high-throughput workflow.

Where to Deploy Models

Now that we have discussed inference and hardware options, we can get into the important decision of where to deploy your models. There are several options:

  1. Cloud compute uses machines hosted by a cloud computing provider.
  2. On-premise compute uses machines hosted in your own facilities.
  3. Edge compute uses compute on edge devices used directly by users, such as mobile phones.

Cloud compute is by far the easiest and most popular option. If you are deploying in the cloud, it is also possible to either serve models using standard architecture used for deploying cloud services, or to use AI-specific solutions.

General Cloud-based Solutions

The most common way to deploy models in the cloud is to use a general-purpose cloud infrastructure, and serve your models using a web server. This is the same architecture used for deploying any other web service, and is the most flexible option, but you will need to implement more things on your own.

In order to do this, you can use any of the many cloud computing providers, such as Amazon Web Services, Google Cloud Platform, or Microsoft Azure. Deployment in this case will generally involve:

  • Implementing an API: You will need to implement an API that can be used to send data to your model and receive predictions. This can be done using any web framework, such as Flask, Django, or FastAPI.
  • Building a Container: You will need to build a container that contains your model and the code for your API. This can be done using any containerization technology, where Docker is the industry standard.
  • Deploying the Container: You will need to deploy the container to a cloud computing provider. Kubernetes is a common option, and most cloud providers also have their own proprietary solutions.

AI-specific Cloud Deployment Solutions

There are also a number of AI-specific solutions that are designed to make it easier to deploy models in the cloud. These solutions are generally more opinionated and have more restrictions, but they also provide more functionality out of the box. In addition, there will usually be a significant increase in running cost due to the intermediary layer. Note that the below are solutions that are designed for deploying AI models that you have already created yourself, and you can also browse our list of API-based AI Solutions if you are looking for something more plug-and-play.

On-premise Computing

On-premise compute can be somewhat less expensive than cloud compute, and is mainly suitable for organizations that have large computational demands and in-house system administration expertise. One argument for on-premise compute has traditionally been privacy protection of sensitive data, but on-premise computing has its own risks due to most organizations having fewer resources to maintain strict security protocols compared to cloud providers.

If you are looking to build your own on-premise compute infrastructure, it is probably easiest to go with a vendor that specializes in building AI-related systems, such as those provided by:

Edge Computing

Edge compute is the only solution in cases where the AI system must work on devices without internet connectivity. Some examples may include speech recognition or translation systems that need to work in sparsely populated areas, or systems for autonomous robots that work in disaster areas. It is also the least expensive solution in terms of hardware cost, as the hardware is provided by the users. However, it comes along with strict computational constraints that all related software must be able to run on edge hardware.

This variety of deployment will require significant software engineering and AI expertise, and should generally be done when really necessary. If you are looking to learn more about these options, some of the following tools may be helpful:

Next Steps

Once you have a model deployed, you will want to make sure that it is working as expected through AI monitoring and debugging tools or make performance improvements.