A summary of the best methods to create data for testing and training AI systems.

Once you have formally defined the task that you want to tackle and chosen your evaluation, in almost all cases you will want to create a dataset of examples that you can use to test and/or train your system.

Types and Amount of Data to Collect

The data that you collect should be split into several sets:

  1. Test Set: This is a set of data that will be used to evaluate your system, and is as close as possible to the data that you will actually be seeing when the system is deployed. It is also paramount that this data is of high quality.
  2. Training Set: This is a set of data that will be used to train your system. In general, you will want this dataset to be bigger, and it is OK if the data you collect is a bit noisy or a bit farther from the data you will actually need to be processing.
  3. Development Set: This is a set of data similar to the test set that you can use to tune various system settings, or “hyper-parameters”. This is also often called a “validation set”.

One of the first questions you may have is “how much data will I need?” For the test set, the answer is relatively straightforward: you will want a test set that is large enough (and representative enough) that you can compare multiple systems that you’re trying and determine which one is the best with confidence, and similarly for the development set. The exact details depend on the details of the tasks and the systems that you’re evaluating, but as a general rule of thumb 1,000-5,000 examples is usually a pretty reasonably sized test set to start out with. There are also methods such as power analysis that can allow you to more rigorously determine the size of the ideal test set. If your budget doesn’t allow creating a test set this large you can start with a smaller test set, but you will need to be more careful about not drawing too many conclusions from small accuracy differences.

For the training set, the answer is a bit more complicated. If you are relying on a pre-packaged API-based system, then you may need no training data whatsoever, and can just use the API to generate predictions. However, if you are building a system yourself from scratch you may need a quite large dataset to build a system, and in general the more clean data that you can acquire, the better the accuracy you can expect. If you are adapting an existing system to a new task, the number may be somewhere in between.

Obtaining Clean Data for Testing

For most of the metrics listed above, you will need to use a set of testing data that includes inputs and outputs, where the inputs are similar to the inputs you want your system to process, and the outputs are “good” examples of what you would like your system to predict.

Using Public Datasets

In some cases there will already be curated datasets that are similar to the use case that you are interested in. Which one you choose will be on a case-by-case basis. If you contact us with a brief description of your use-case below we can try to help find a dataset that may suit your needs:

In the case that a public dataset already exists, this is helpful for several reasons.

  1. It is not necessary to spend time and money on data creation.
  2. The evaluation metrics that people use on the dataset may already be decided, which may make the difficult decision of an evaluation method a bit easier.
  3. In most cases, there will already be numbers reported by systems that others have built, and you will be able to compare the numbers that your system is achieving to these numbers to make sure that your system is competitive.

However, at the same time there are several caveats:

  1. You should look at the data very carefully to confirm that it matches your final use case sufficiently closely that the results you get on the public dataset will be meaningful.
  2. You should check the license of the dataset, what are you allowed to use it for? If you are building a commercial product but the dataset license doesn’t allow commercial use then you’ll have to look for other datasets.

Curating a Test Dataset

In the case that there is no appropriate data, you’ll need to curate a test dataset. This can be divided into two steps: collecting the test inputs, and annotating the data with whatever outputs you will want to predict.

The most important thing for test inputs is that they are representative of the inputs that you will actually want to process when the system is running. For example, if you are building a system to identify spam posts on a social networking site, it would be ideal to collect actual posts from that social networking site. In addition, you might additionally want to make sure that the posts have broad coverage of a variety of topics or types of users to make sure that the system functions well across a broad swath of potential usage scenarios.

Once you have your inputs, you will need to annotate the outputs that you want your model to predict. This can be done in a variety of ways:

  • Crowdsourcing: You can use a crowdsourcing platform to get people to annotate the data. This is a good option if you have a large amount of data, and you want to minimize the cost of annotation. The most widely used crowdsourcing platform for these tasks is Amazon Mechanical Turk.
  • Freelancers: You can hire freelancers to annotate the data. This is a good option if you need a relatively skilled workforce to handle a specialized task. One representative site for this is Upwork.
  • Managed annotation solutions: You can use a managed annotation solution to annotate the data. This is a good option if you don’t want to deal with the overhead of setting up an annotation service. Some annotation services include Appen, Labelbox, and Scale.
  • In-house annotation: You can hire people to annotate the data in-house. This is a good option if you have particularly sensitive data and need to comply with regulations, or only need to annotate a small amount of data not warranting the startup cost of using an external service.

Obtaining Data for Training

All of the methods mentioned above are also applicable to the acquisition of training data. In addition, when creating data to train your system, you can be a bit more creative. For instance you may:

  1. Acquire Similar Datasets: Through multi-task learning, a technique where a model is trained to perform multiple tasks at once, it is possible to obtain similar datasets to the one you are interested in. These public datasets can be found according to the same process that we mentioned above for testing data, but you can somewhat loosen the criteria that you use to select the datasets, not worrying quite as much about the exact match to your use case.
  2. Synthesize Data: Through data augmentation, it is possible to create data from existing samples in an automatic process. This is often done by somehow perturbing the inputs or labels, or swapping in other content. There are a number of toolkits that make it easier to create this automatic data, such as albumenations for image processing, and NLPAug or snorkel for text processing.

Next Steps

Once you have data for your system, it’s time to move on to choosing or building your AI solution.