Skip to main content

BLEU Score

BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine-translated text. While originally designed for translation tasks, it's also useful for evaluating text generation and summarization.

Overview

BLEU works by comparing n-gram matches between the generated text and one or more reference texts. It combines precision scores for different n-gram lengths (typically 1-4) and includes a brevity penalty to discourage very short outputs.

How It Works

BLEU calculates scores using these components:

  1. N-gram precision (usually for n=1,2,3,4)
  2. Brevity penalty to penalize short outputs
  3. Geometric mean of the n-gram scores

The final BLEU score is computed as:

BLEU = BP × exp(Σ wₙ × log pₙ)

Where:

  • BP is the brevity penalty
  • wₙ are weights for each n-gram (typically uniform)
  • pₙ are the n-gram precisions

Usage Example Python

from assert_llm_tools.core import evaluate_summary

metrics = ["bleu"]

full_text = "full text"
summary = "summary text"

metrics = evaluate_summary(
full_text,
summary,
metrics=metrics,
)

print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")

Interpretation

  • Scores range from 0 to 1 (often reported as 0-100)
  • Higher scores indicate better alignment with reference text
  • Typical scores vary by task:
    • High-quality human translation: 0.5-0.8
    • Machine translation: 0.2-0.5
    • Text generation: varies widely by task

Advantages

  • Well-established metric with years of use
  • Language-independent
  • Correlates reasonably well with human judgments
  • Fast and easy to compute

Limitations

  1. Exact Match Requirement: Only considers exact word matches, missing semantic similarities
  2. Reference Dependency: Requires one or more reference texts
  3. Word Order Sensitivity: Limited ability to handle valid reorderings
  4. Length Bias: Can be biased against longer or shorter outputs despite the brevity penalty
  5. Task Specificity: Originally designed for translation, may not be ideal for other tasks

Best Practices

  1. Use multiple reference texts when possible
  2. Consider custom n-gram weights based on your task
  3. Use BLEU alongside other metrics for a more complete evaluation
  4. Be cautious when comparing BLEU scores across different systems or datasets