ROUGE Score
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation.
Overview
ROUGE measures the quality of a summary by comparing it to reference summaries created by humans. It counts the number of overlapping units such as n-grams, word sequences, and word pairs between the generated summary and the reference summaries.
Types of ROUGE
We implement three main ROUGE variants:
- ROUGE-1: Measures unigram (single word) overlap
- ROUGE-2: Measures bigram (two consecutive words) overlap
- ROUGE-L: Measures Longest Common Subsequence (LCS) between texts
For each variant, we calculate:
- Precision: Proportion of n-grams in the generated summary that appear in the reference
- Recall: Proportion of n-grams in the reference that appear in the generated summary
- F1-score: Harmonic mean of precision and recall
Usage Example
from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.utils import add_custom_stopwords
metrics = ["rouge"]
full_text = "full text"
summary = "summary text"
metrics = evaluate_summary(
full_text,
summary,
metrics=metrics,
)
print("\nEvaluation Metrics:")
# ROUGE-1 scores
print(f"ROUGE-1 Precision: {metrics['rouge1_precision']:.4f}")
print(f"ROUGE-1 Recall: {metrics['rouge1_recall']:.4f}")
print(f"ROUGE-1 F1: {metrics['rouge1_f1']:.4f}")
# ROUGE-2 scores
print(f"ROUGE-2 Precision: {metrics['rouge2_precision']:.4f}")
print(f"ROUGE-2 Recall: {metrics['rouge2_recall']:.4f}")
print(f"ROUGE-2 F1: {metrics['rouge2_f1']:.4f}")
# ROUGE-L scores
print(f"ROUGE-L Precision: {metrics['rougeL_precision']:.4f}")
print(f"ROUGE-L Recall: {metrics['rougeL_recall']:.4f}")
print(f"ROUGE-L F1: {metrics['rougeL_f1']:.4f}")
Interpretation
Each ROUGE variant provides different insights:
ROUGE-1:
- Good for assessing content coverage
- Less sensitive to word order
- Scores range from 0 to 1
ROUGE-2:
- Better at capturing fluency and readability
- More sensitive to word order
- Typically lower than ROUGE-1 scores
ROUGE-L:
- Captures longest in-sequence matches
- More flexible than fixed n-gram matches
- Good balance between ROUGE-1 and ROUGE-2
For all metrics:
- Higher scores indicate better alignment with reference text
- Precision, recall, and F1 provide different perspectives on quality
Limitations
- Focuses on lexical overlap rather than semantic meaning
- May miss semantically correct paraphrases
- Requires human-written reference summaries
- Different variants may give conflicting signals
- No single "best" metric - consider multiple scores together