BLEU Score
BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine-translated text. While originally designed for translation tasks, it's also useful for evaluating text generation and summarization.
Overview
BLEU works by comparing n-gram matches between the generated text and one or more reference texts. It combines precision scores for different n-gram lengths (typically 1-4) and includes a brevity penalty to discourage very short outputs.
How It Works
BLEU calculates scores using these components:
- N-gram precision (usually for n=1,2,3,4)
- Brevity penalty to penalize short outputs
- Geometric mean of the n-gram scores
The final BLEU score is computed as:
BLEU = BP × exp(Σ wₙ × log pₙ)
Where:
- BP is the brevity penalty
- wₙ are weights for each n-gram (typically uniform)
- pₙ are the n-gram precisions
Usage Example Python
from assert_llm_tools.core import evaluate_summary
metrics = ["bleu"]
full_text = "full text"
summary = "summary text"
metrics = evaluate_summary(
full_text,
summary,
metrics=metrics,
)
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")
Interpretation
- Scores range from 0 to 1 (often reported as 0-100)
- Higher scores indicate better alignment with reference text
- Typical scores vary by task:
- High-quality human translation: 0.5-0.8
- Machine translation: 0.2-0.5
- Text generation: varies widely by task
Advantages
- Well-established metric with years of use
- Language-independent
- Correlates reasonably well with human judgments
- Fast and easy to compute
Limitations
- Exact Match Requirement: Only considers exact word matches, missing semantic similarities
- Reference Dependency: Requires one or more reference texts
- Word Order Sensitivity: Limited ability to handle valid reorderings
- Length Bias: Can be biased against longer or shorter outputs despite the brevity penalty
- Task Specificity: Originally designed for translation, may not be ideal for other tasks
Best Practices
- Use multiple reference texts when possible
- Consider custom n-gram weights based on your task
- Use BLEU alongside other metrics for a more complete evaluation
- Be cautious when comparing BLEU scores across different systems or datasets