This page describes how you can use Critique to assess the quality of translations from one language to another.
What is Translation Quality?
Let’s say that you’re using a Translation API to translate the product descriptions on your web site from English to Japanese. Simple enough, you put in this text:
High-quality stockings, available in black and pink. Stretchy, and no worries about runs!
And you get this back:
高品質なストッキング、ブラックとピンクの両方があります。 伸縮性があり、走行の心配はありません!
You put the translated text on your web site and you’re ready to go. But how do you know if the translation is wonderful and ready to show to some customers, or if it’s completely wrong and embarrassing?
In this case, the first translation is pretty good, maybe a tiny bit less natural than a human translator. However, the second translation is completely wrong, the word “run” is translated into “走行” which means “to operate” or “to drive” and has nothing to do with a “run” in a stocking. But it’s pretty hard to tell this if you don’t know the language, especially at scale!
Critique’s translation quality estimation give a high score to good translations, and low score to bad translations, helping you with use cases such as filtering, monitoring, and system comparison or improvement.
One important difference in how you estimate translation quality is whether you have a reference translation or not. A reference translation is a translation or translations of the source text that was created by human translators:
- Reference-free evaluation only requires the input text and the output from a translation system. This is the most common case, such as the example of filtering translation outputs above.
- Reference-based evaluation requires the output from a translation system, one or more reference translations, and sometimes the input text. This is useful when you want to monitor or compare systems, and you can acquire translations from translators to create some references for this purpose.
In general, reference-based evaluation can be more accurate, but because it requires more resources it is less broadly applicable. You can choose which one is best for your use case.
Critique API for Translation Quality
Please first see the Getting Started page for instructions on basic instructions on how to use the Critique API, and then you can dive in the details of translation quality below.
Reference-free evaluation
For reference-free evaluation, you will want to prepare your data in the following format:
dataset = [
{
"source": "High-quality stockings, available in black and pink.",
"target": "高品質なストッキング、ブラックとピンクの両方があります。",
},
{
"source": "Stretchy, and no worries about runs!",
"target": "伸縮性があり、走行の心配はありません!",
},
]
Critique has a number of settings that can be used to perform translation quality estimation, but this is the current recommended setting for reference-free evaluation:
metric = "comet"
config = {"model": "unbabel_comet/wmt21-comet-qe-da"}
You can then call the API to evaluate the outputs:
from inspiredco.critique import Critique
client = Critique(api_key=os.environ["INSPIREDCO_API_KEY"])
result = client.evaluate(metric=metric, config=config, dataset=dataset)
Reference-based evaluation
In reference-based evaluation, you will need to provide references (and sometimes also a source) for each target. For example, you could prepare the following data:
dataset = [
{
"source": "High-quality stockings, available in black and pink.",
"target": "高品質なストッキング、ブラックとピンクの両方があります。",
"references": ["上品なストッキング、黒とピンクがあります。"]
},
{
"source": "Stretchy, and no worries about runs!",
"target": "伸縮性があり、走行の心配はありません!",
"references": ["伸縮性があり、伝線の恐れはありません!"]
},
]
In this case, you may want to use a different configuration that can take advantage of the references, like the following:
metric = "bert_score"
config = {"model": "bert-base-uncased"}
Various Configurations for Translation Quality
In addition to the recommended settings above, you can consider using the following metrics for translation quality:
See the pages of each for more details about configuration options.