BLEU Score

BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine-translated text. While originally designed for translation tasks, it's also useful for evaluating text generation and summarization.

Overview

BLEU works by comparing n-gram matches between the generated text and one or more reference texts. It combines precision scores for different n-gram lengths (typically 1-4) and includes a brevity penalty to discourage very short outputs.

How It Works

BLEU calculates scores using these components:

N-gram precision (usually for n=1,2,3,4)
Brevity penalty to penalize short outputs
Geometric mean of the n-gram scores

The final BLEU score is computed as:

BLEU = BP × exp(Σ wₙ × log pₙ)

Where:

BP is the brevity penalty
wₙ are weights for each n-gram (typically uniform)
pₙ are the n-gram precisions

Usage Example Python

from assert_llm_tools.core import evaluate_summary

metrics = ["bleu"]

full_text = "full text"
summary = "summary text"

metrics = evaluate_summary(
    full_text,
    summary,
    metrics=metrics,
)

print("\nEvaluation Metrics:")
for metric, score in metrics.items():
    print(f"{metric}: {score:.4f}")

Interpretation

Scores range from 0 to 1 (often reported as 0-100)
Higher scores indicate better alignment with reference text
Typical scores vary by task:
- High-quality human translation: 0.5-0.8
- Machine translation: 0.2-0.5
- Text generation: varies widely by task

Advantages

Well-established metric with years of use
Language-independent
Correlates reasonably well with human judgments
Fast and easy to compute

Limitations

Exact Match Requirement: Only considers exact word matches, missing semantic similarities
Reference Dependency: Requires one or more reference texts
Word Order Sensitivity: Limited ability to handle valid reorderings
Length Bias: Can be biased against longer or shorter outputs despite the brevity penalty
Task Specificity: Originally designed for translation, may not be ideal for other tasks

Best Practices

Use multiple reference texts when possible
Consider custom n-gram weights based on your task
Use BLEU alongside other metrics for a more complete evaluation
Be cautious when comparing BLEU scores across different systems or datasets

BLEU Score

Overview​

How It Works​

Usage Example Python​

Interpretation​

Advantages​

Limitations​

Best Practices​