This page describes how you can use Critique to compare different AI systems.
AI System Comparison
Very often, you will have multiple AI systems that you want to compare to each other.
For instance, you might have multiple API providers such as Google, IBM, and OpenAI, which all be used to translate content on your site. In this case, you would want to compare the accuracy of the different systems to decide which one to use (also considering other factors such as latency or cost).
You might also be doing AI system development, where you are training multiple models to solve the same problem. In this case it’s important to rigorously compare the different models to know which one is more accurate.
Code Example for Model Improvement
In this example, we will assume that you have prepared a test set of inputs and gold-standard outputs created by a human annotator, and want to evaluate multiple AI systems based on these outputs.
Specifically, we will use an example of comparing the accuracy of outputs from different translation systems that are translating from Japanese to English.
We start with a set of Japanese sentences source
and the corresponding reference English translations references
, and the system outputs targets
.
source = [
"スタジオ騒然のしくじりエピソードを披露する。",
"スタジオ一同ドン引き?",
]
reference = [
"She introduces an episode that shocks the whole studio.",
"Is the whole studio disgusted?",
]
targets= [
[
"The studio is in a state of turmoil and shows an episode.",
"Studio everyone don pull?",
],
[
"The studio reveals a failure episode.",
"Is everyone shocked?",
]
]
We will use three metrics for comparison: BLEU, ChrF and BERTScore. So we create a list metrics and their configurations:
metrics = [
{"metric": "bleu", "config": {}},
{"metric": "chrf", "config": {}},
{"metric": "bert_score", "config": {"model": "bert-base-uncased"}},
]
Next we will make calls to the Critique client to submit the evaluation tasks to Critique.
You can read the getting started page for more information on how to acquire
an API key, and also to understand the details of the submit_task()
and fetch_results()
methods.
from inspiredco.critique import Critique
import os
# Create a Critique client
client = Critique(api_key=os.environ["INSPIREDCO_API_KEY"])
# Submit all evaluation tasks to Critique
all_tasks = []
for system_id, single_target in enumerate(targets):
single_task_ids = []
# Create a dataset in the format to submit to critique
dataset = [
{"source": s, "references": [r], "target": t}
for s, r, t in zip(source, reference, single_target)
]
# Submit task for each metric
for metric in metrics:
task_id = client.submit_task(
metric=metric["metric"],
config=metric["config"],
dataset=dataset,
)
print(f"Submitted system {system_id+1} and metric {metric['metric']}: {task_id=}")
single_task_ids.append(task_id)
all_tasks.append(single_task_ids)
# Fetch results for all tasks
print("Fetching results...")
results = [
[client.wait_for_result(task_id) for task_id in task_ids] for task_ids in all_tasks
]
# Print results for each system
for system_id, result in enumerate(results):
for metric, res in zip(metrics, result):
print(f"System {system_id+1} {metric['metric']}: {res['overall']['value']}")