Evaluation of machine translation
Encyclopedia
Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation
, rather than on performance or usability evaluation.
Consider the following examples of round-trip translation performed from English
to Italian
and Portuguese
from Somers (2005):
In the first example, where the text is translated into Italian
then back into English
—the English text is significantly garbled, but the Italian is a serviceable translation. In the second example, the text translated back into English is perfect, but the Portuguese
translation is meaningless.
While round-trip translation may be useful to generate a "surplus of fun," the methodology is deficient for serious study of machine translation quality.
1966 study and the ARPA study.
into English
with human translators, on two variables.
The variables studied were "intelligibility" and "fidelity". Intelligibility was a measure of how "understandable" the sentence was, and was measured on a scale of 1—9. Fidelity was a measure of how much information the translated sentence retained compared to the original, and was measured on a scale of 0—9. Each point on the scale was associated with a textual description. For example, 3 on the intelligibility scale was described as "Generally unintelligible; it tends to read like nonsense but, with a considerable amount of reflection and study, one can at least hypothesize the idea intended by the sentence".
Intelligibility was measured without reference to the original, while fidelity was measured indirectly. The translated sentence was presented, and after reading it and absorbing the content, the original sentence was presented. The judges were asked to rate the original sentence on informativeness. So, the more informative the original sentence, the lower the quality of the translation.
The study showed that the variables were highly correlated when the human judgment was averaged per
sentence. The variation among raters
was small, but the researchers recommended that at the very least, three or four raters should be used. The evaluation methodology managed to separate translations by humans from translations by machines with ease.
The study concluded that, "highly reliable assessments can be made of the quality of human and machine translations".
The evaluation programme involved testing several systems based on different theoretical approaches; statistical,
rule-based and human-assisted. A number of methods for the evaluation of the output from these systems were tested in 1992 and the most recent suitable methods were selected for inclusion in the programmes for subsequent years. The methods were; comprehension evaluation, quality panel evaluation, and evaluation based on adequacy and fluency.
Comprehension evaluation aimed to directly compare systems based on the results from multiple choice comprehension tests, as in Church et al. (1993). The texts chosen were a set of articles in English on the subject of financial news. These articles were translated by professional translators into a series of language pairs, and then translated back into English using the machine translation systems. It was decided that this was not adequate for a standalone method of comparing systems and as such abandoned due to issues with the modification of meaning in the process of translating from English.
The idea of quality panel evaluation was to submit translations to a panel of expert native English speakers who were professional translators and get them to evaluate them. The evaluations were done on the basis of a metric, modelled on a standard US government metric used to rate human translations. This was good from the point of view that the metric was "externally motivated", since it was not specifically developed for machine translation. However, the quality panel evaluation was very difficult to set up logistically, as it necessitated having a number of experts together in one place for a week or more, and furthermore for them to reach consensus. This method was also abandoned.
Along with a modified form of the comprehension evaluation (re-styled as informativeness evaluation), the most
popular method was to obtain ratings from monolingual judges for segments of a document. The judges were presented with a segment, and asked to rate it for two variables, adequacy and fluency. Adequacy is a rating of how much information is transferred between the original and the translation, and fluency is a rating of how good the English is. This technique was found to cover the relevant parts of the quality panel evaluation, while at the same time being easier to deploy, as it didn't require expert judgment.
Measuring systems based on adequacy and fluency, along with informativeness is now the standard methodology for the
ARPA evaluation program.
is a measurement. A metric that evaluates machine translation output represents the quality of the output. The quality of a translation is inherently subjective, there is no objective or quantifiable "good." Therefore, any metric must assign quality scores so they correlate with human judgment of quality. That is, a metric should score highly translations that humans score highly, and give low scores to those humans give low scores. Human judgment is the benchmark for assessing automatic metrics, as humans are the end-users of any translation output.
The measure of evaluation for metrics is correlation
with human judgment. This is generally done at two levels, at the sentence level, where scores are calculated by the metric for a set of translated sentences, and then correlated against human judgment for the same sentences. And at the corpus level, where scores over the sentences are aggregated for both human judgments and metric judgments, and these aggregate scores are then correlated. Figures for correlation at the sentence level are rarely reported, although Banerjee et al. (2005) do give correlation figures that show that, at least for their metric, sentence level correlation is substantially worse than corpus level correlation.
While not widely reported, it has been noted that the genre, or domain, of a text has an effect on the correlation obtained when using metrics. Coughlin (2003) reports that comparing the candidate text against a single reference translation does not adversely affect the correlation of metrics when working in a restricted domain text.
Even if a metric correlates well with human judgment in one study on one corpus, this successful correlation may not carry over to another corpus. Good metric performance, across text types or domains, is important for the reusability of the metric. A metric that only works for text in a specific domain is useful, but less useful than one that works across many domains—because creating a new metric for every new evaluation or domain is undesirable.
Another important factor in the usefulness of an evaluation metric is to have good correlation, even when working with small amounts of data, that is candidate sentences and reference translations. Turian et al. (2003) point out that, "Any MT evaluation measure is less reliable on shorter translations", and show that increasing the amount of data improves the reliability of a metric. However, they add that "... reliability on shorter texts, as short as one sentence or even one phrase, is highly desirable because a reliable MT evaluation measure can greatly accelerate exploratory data analysis".
Banerjee et al. (2005) highlight five attributes that a good automatic metric must possess; correlation, sensitivity, consistency, reliability and generality. Any good metric must correlate highly with human judgment, it must be consistent, giving similar results to the same MT system on similar text. It must be sensitive to differences between MT systems and reliable in that MT systems that score similarly should be expected to perform similarly. Finally, the metric must be general, that is it should work with different text domains, in a wide range of scenarios and MT tasks.
The aim of this subsection is to give an overview of the state of the art in automatic metrics for evaluating machine translation.
metric is currently one of the most popular in the field. The central idea behind the metric is that "the closer a
machine translation is to a professional human translation, the better it is". The metric calculates scores for individual segments, generally sentences—then averages these scores over the whole corpus for a final score. It has been shown to correlate highly with human judgments of quality at the corpus level.
BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations. The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text.
metric, but with some alterations. Where BLEU
simply calculates n-gram
precision adding equal weight to each one, NIST also calculates how informative a particular n-gram
is. That is to say when a correct n-gram
is found, the rarer that n-gram is, the more weight it is given. For example, if the bigram "on the" correctly matches, it receives lower weight than the correct matching of bigram "interesting calculations," as this is less likely to occur. NIST also differs from BLEU
in its calculation of the brevity penalty, insofar as small variations in translation length do not impact the overall score as much.
, where the Levenshtein distance works at the character level, WER works at the word level. It was originally used for measuring the performance of speech recognition
systems, but is also used in the evaluation of machine translation. The metric is based on the calculation of the number of words that differ between a piece of machine translated text and a reference translation.
A related metric is the Position-independent word error rate (PER), this allows for re-ordering of words and sequences of words between a translated text and a references translation.
is based on the weighted harmonic mean
of unigram precision and unigram recall. The metric was designed after research by Lavie (2004) into the significance of recall in evaluation metrics. Their research showed that metrics based on recall consistently achieved higher correlation than those based on precision alone, cf. BLEU and NIST.
METEOR also includes some other features not found in other metrics, such as synonymy matching, where instead of matching only on the exact word form, the metric also matches on synonyms. For example, the word "good" in the reference rendering as "well" in the translation counts as a match. The metric is also includes a stemmer, which lemmatises words and matches on the lemmatised forms. The implementation of the metric is modular insofar as the algorithms that match words are implemented as modules, and new modules that implement different matching strategies may easily be added.
Machine translation
Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...
, rather than on performance or usability evaluation.
Round-trip translation
A typical way for lay people to assess machine translation quality is to translate from a source language to a target language and back to the source language with the same engine. Though intuitively this seem a good method of evaluation, it has been shown that round-trip translation is a, "poor predictor of quality". The reason why it is such a poor predictor of quality is reasonably intuitive. A round-trip translation is not testing one system, but two systems: the language pair of the engine for translating in to the target language, and the language pair translating back from the target language.Consider the following examples of round-trip translation performed from English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
to Italian
Italian language
Italian is a Romance language spoken mainly in Europe: Italy, Switzerland, San Marino, Vatican City, by minorities in Malta, Monaco, Croatia, Slovenia, France, Libya, Eritrea, and Somalia, and by immigrant communities in the Americas and Australia...
and Portuguese
Portuguese language
Portuguese is a Romance language that arose in the medieval Kingdom of Galicia, nowadays Galicia and Northern Portugal. The southern part of the Kingdom of Galicia became independent as the County of Portugal in 1095...
from Somers (2005):
Original text | Select this link to look at our home page. |
Translated | Selezioni questo collegamento per guardare il nostro Home Page. |
Translated back | Selections this connection in order to watch our Home Page. |
Original text | Tit for tat |
Translated | Melharuco para o tat |
Translated back | Tit for tat |
In the first example, where the text is translated into Italian
Italian language
Italian is a Romance language spoken mainly in Europe: Italy, Switzerland, San Marino, Vatican City, by minorities in Malta, Monaco, Croatia, Slovenia, France, Libya, Eritrea, and Somalia, and by immigrant communities in the Americas and Australia...
then back into English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
—the English text is significantly garbled, but the Italian is a serviceable translation. In the second example, the text translated back into English is perfect, but the Portuguese
Portuguese language
Portuguese is a Romance language that arose in the medieval Kingdom of Galicia, nowadays Galicia and Northern Portugal. The southern part of the Kingdom of Galicia became independent as the County of Portugal in 1095...
translation is meaningless.
While round-trip translation may be useful to generate a "surplus of fun," the methodology is deficient for serious study of machine translation quality.
Human evaluation
This section covers two of the large scale evaluation studies that have had significant impact on the field—the ALPACALPAC
ALPAC was a committee of seven scientists led by John R. Pierce, established in 1964 by the U. S. Government in order to evaluate the progress in computational linguistics in general and machine translation in particular...
1966 study and the ARPA study.
Automatic Language Processing Advisory Committee (ALPAC)
One of the constituent parts of the ALPAC report was a study comparing different levels of human translation with machine translation output, using human subjects as judges. The human judges were specially trained for the purpose. The evaluation study compared an MT system translating from RussianRussian language
Russian is a Slavic language used primarily in Russia, Belarus, Uzbekistan, Kazakhstan, Tajikistan and Kyrgyzstan. It is an unofficial but widely spoken language in Ukraine, Moldova, Latvia, Turkmenistan and Estonia and, to a lesser extent, the other countries that were once constituent republics...
into English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
with human translators, on two variables.
The variables studied were "intelligibility" and "fidelity". Intelligibility was a measure of how "understandable" the sentence was, and was measured on a scale of 1—9. Fidelity was a measure of how much information the translated sentence retained compared to the original, and was measured on a scale of 0—9. Each point on the scale was associated with a textual description. For example, 3 on the intelligibility scale was described as "Generally unintelligible; it tends to read like nonsense but, with a considerable amount of reflection and study, one can at least hypothesize the idea intended by the sentence".
Intelligibility was measured without reference to the original, while fidelity was measured indirectly. The translated sentence was presented, and after reading it and absorbing the content, the original sentence was presented. The judges were asked to rate the original sentence on informativeness. So, the more informative the original sentence, the lower the quality of the translation.
The study showed that the variables were highly correlated when the human judgment was averaged per
sentence. The variation among raters
Inter-rater reliability
In statistics, inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement among raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by...
was small, but the researchers recommended that at the very least, three or four raters should be used. The evaluation methodology managed to separate translations by humans from translations by machines with ease.
The study concluded that, "highly reliable assessments can be made of the quality of human and machine translations".
Advanced Research Projects Agency (ARPA)
As part of the Human Language Technologies Program, the Advanced Research Projects Agency (ARPA) created a methodology to evaluate machine translation systems, and continues to perform evaluations based on this methodology. The evaluation programme was instigated in 1991, and continues to this day. Details of the programme can be found in White et al. (1994) and White (1995).The evaluation programme involved testing several systems based on different theoretical approaches; statistical,
rule-based and human-assisted. A number of methods for the evaluation of the output from these systems were tested in 1992 and the most recent suitable methods were selected for inclusion in the programmes for subsequent years. The methods were; comprehension evaluation, quality panel evaluation, and evaluation based on adequacy and fluency.
Comprehension evaluation aimed to directly compare systems based on the results from multiple choice comprehension tests, as in Church et al. (1993). The texts chosen were a set of articles in English on the subject of financial news. These articles were translated by professional translators into a series of language pairs, and then translated back into English using the machine translation systems. It was decided that this was not adequate for a standalone method of comparing systems and as such abandoned due to issues with the modification of meaning in the process of translating from English.
The idea of quality panel evaluation was to submit translations to a panel of expert native English speakers who were professional translators and get them to evaluate them. The evaluations were done on the basis of a metric, modelled on a standard US government metric used to rate human translations. This was good from the point of view that the metric was "externally motivated", since it was not specifically developed for machine translation. However, the quality panel evaluation was very difficult to set up logistically, as it necessitated having a number of experts together in one place for a week or more, and furthermore for them to reach consensus. This method was also abandoned.
Along with a modified form of the comprehension evaluation (re-styled as informativeness evaluation), the most
popular method was to obtain ratings from monolingual judges for segments of a document. The judges were presented with a segment, and asked to rate it for two variables, adequacy and fluency. Adequacy is a rating of how much information is transferred between the original and the translation, and fluency is a rating of how good the English is. This technique was found to cover the relevant parts of the quality panel evaluation, while at the same time being easier to deploy, as it didn't require expert judgment.
Measuring systems based on adequacy and fluency, along with informativeness is now the standard methodology for the
ARPA evaluation program.
Automatic evaluation
In the context of this article, a metricMetric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...
is a measurement. A metric that evaluates machine translation output represents the quality of the output. The quality of a translation is inherently subjective, there is no objective or quantifiable "good." Therefore, any metric must assign quality scores so they correlate with human judgment of quality. That is, a metric should score highly translations that humans score highly, and give low scores to those humans give low scores. Human judgment is the benchmark for assessing automatic metrics, as humans are the end-users of any translation output.
The measure of evaluation for metrics is correlation
Correlation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....
with human judgment. This is generally done at two levels, at the sentence level, where scores are calculated by the metric for a set of translated sentences, and then correlated against human judgment for the same sentences. And at the corpus level, where scores over the sentences are aggregated for both human judgments and metric judgments, and these aggregate scores are then correlated. Figures for correlation at the sentence level are rarely reported, although Banerjee et al. (2005) do give correlation figures that show that, at least for their metric, sentence level correlation is substantially worse than corpus level correlation.
While not widely reported, it has been noted that the genre, or domain, of a text has an effect on the correlation obtained when using metrics. Coughlin (2003) reports that comparing the candidate text against a single reference translation does not adversely affect the correlation of metrics when working in a restricted domain text.
Even if a metric correlates well with human judgment in one study on one corpus, this successful correlation may not carry over to another corpus. Good metric performance, across text types or domains, is important for the reusability of the metric. A metric that only works for text in a specific domain is useful, but less useful than one that works across many domains—because creating a new metric for every new evaluation or domain is undesirable.
Another important factor in the usefulness of an evaluation metric is to have good correlation, even when working with small amounts of data, that is candidate sentences and reference translations. Turian et al. (2003) point out that, "Any MT evaluation measure is less reliable on shorter translations", and show that increasing the amount of data improves the reliability of a metric. However, they add that "... reliability on shorter texts, as short as one sentence or even one phrase, is highly desirable because a reliable MT evaluation measure can greatly accelerate exploratory data analysis".
Banerjee et al. (2005) highlight five attributes that a good automatic metric must possess; correlation, sensitivity, consistency, reliability and generality. Any good metric must correlate highly with human judgment, it must be consistent, giving similar results to the same MT system on similar text. It must be sensitive to differences between MT systems and reliable in that MT systems that score similarly should be expected to perform similarly. Finally, the metric must be general, that is it should work with different text domains, in a wide range of scenarios and MT tasks.
The aim of this subsection is to give an overview of the state of the art in automatic metrics for evaluating machine translation.
BLEU
BLEU was one of the first metrics to report high correlation with human judgments of quality. Themetric is currently one of the most popular in the field. The central idea behind the metric is that "the closer a
machine translation is to a professional human translation, the better it is". The metric calculates scores for individual segments, generally sentences—then averages these scores over the whole corpus for a final score. It has been shown to correlate highly with human judgments of quality at the corpus level.
BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations. The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text.
NIST
The NIST metric is based on the BLEUBleu
bleu or BLEU may refer to:* the French word for blue* Three Colors: Blue, a 1993 movie* Bilingual Evaluation Understudy, a machine translation evaluation metric* Belgium–Luxembourg Economic Union...
metric, but with some alterations. Where BLEU
Bleu
bleu or BLEU may refer to:* the French word for blue* Three Colors: Blue, a 1993 movie* Bilingual Evaluation Understudy, a machine translation evaluation metric* Belgium–Luxembourg Economic Union...
simply calculates n-gram
N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...
precision adding equal weight to each one, NIST also calculates how informative a particular n-gram
N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...
is. That is to say when a correct n-gram
N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...
is found, the rarer that n-gram is, the more weight it is given. For example, if the bigram "on the" correctly matches, it receives lower weight than the correct matching of bigram "interesting calculations," as this is less likely to occur. NIST also differs from BLEU
Bleu
bleu or BLEU may refer to:* the French word for blue* Three Colors: Blue, a 1993 movie* Bilingual Evaluation Understudy, a machine translation evaluation metric* Belgium–Luxembourg Economic Union...
in its calculation of the brevity penalty, insofar as small variations in translation length do not impact the overall score as much.
Word error rate
The Word error rate (WER) is a metric based on the Levenshtein distanceLevenshtein distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...
, where the Levenshtein distance works at the character level, WER works at the word level. It was originally used for measuring the performance of speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...
systems, but is also used in the evaluation of machine translation. The metric is based on the calculation of the number of words that differ between a piece of machine translated text and a reference translation.
A related metric is the Position-independent word error rate (PER), this allows for re-ordering of words and sequences of words between a translated text and a references translation.
METEOR
The METEOR metric is designed to address some of the deficiencies inherent in the BLEU metric. The metricis based on the weighted harmonic mean
Harmonic mean
In mathematics, the harmonic mean is one of several kinds of average. Typically, it is appropriate for situations when the average of rates is desired....
of unigram precision and unigram recall. The metric was designed after research by Lavie (2004) into the significance of recall in evaluation metrics. Their research showed that metrics based on recall consistently achieved higher correlation than those based on precision alone, cf. BLEU and NIST.
METEOR also includes some other features not found in other metrics, such as synonymy matching, where instead of matching only on the exact word form, the metric also matches on synonyms. For example, the word "good" in the reference rendering as "well" in the translation counts as a match. The metric is also includes a stemmer, which lemmatises words and matches on the lemmatised forms. The implementation of the metric is modular insofar as the algorithms that match words are implemented as modules, and new modules that implement different matching strategies may easily be added.
Further reading
- Machine Translation Archive: Subject Index : Publications after 2000 (see Evaluation subheading)
- Machine Translation Archive: Subject Index : Publications prior to 2000 (see Evaluation subheading)