BLEU [NLP]

Description

Bilingual Evaluation Understudy (BLEU) is an automated metric that evaluates the quality of machine-translated text by measuring its similarity to one or more high-quality human reference translations. It works by comparing sequences of words (n-grams) in the machine output against the reference texts, producing a score between 0 and 1, where higher values indicate a closer match.

Info

BLEU is widely used due to its low cost and correlation with human judgment, though it does not consider grammatical correctness or semantics.

Strengths:

Widely adopted: It is a standard and popular metric in the field of machine translation.
Automated and inexpensive: It provides a fast, automated way to evaluate translation quality without human review for every sentence.
High correlation: It correlates well with human judgment of quality on average, especially over a large number of sentences.

Limitations:

Ignores semantics: BLEU does not consider the meaning of the text, focusing only on word overlap. A translation could have a high score but be semantically incorrect.
Grammatical errors: It does not explicitly penalize grammatical errors.
Tokenization dependency: Early versions relied on specific tokenization methods, which could make it hard to compare models that used different tokenizers. Modern implementations often handle this better.

Workflow

N-gram comparison: The algorithm compares consecutive phrases, or n-grams, of the machine translation against the human references.
Precision: It calculates the precision of these matches, which is the percentage of n-grams in the machine's output that are also found in the reference translations.
Corpus-level calculation: BLEU averages the n-gram matches over an entire corpus of sentences to get a more stable score than evaluating individual sentences.
Brevity penalty: A penalty is applied to machine translations that are too short compared to the reference translations, ensuring that a high score isn't achieved by being overly brief.
Score range: The final score is a number between 0 and 1 (or 0% and 100%).

Example

from nltk.translate.bleu_score import sentence_bleu

candidate_translation = "the cat sat on the mat"
reference_translations = [
    "a cat is on the mat",
    "the cat is on the mat"
]

# Convert strings to lists of words
candidate = candidate_translation.split()
references = [ref.split() for ref in reference_translations]

score = sentence_bleu(references, candidate)
print(f"BLEU score with multiple references: {score:.3f}")