ROUGE Score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation.

Overview

ROUGE measures the quality of a summary by comparing it to reference summaries created by humans. It counts the number of overlapping units such as n-grams, word sequences, and word pairs between the generated summary and the reference summaries.

Types of ROUGE

We implement three main ROUGE variants:

ROUGE-1: Measures unigram (single word) overlap
ROUGE-2: Measures bigram (two consecutive words) overlap
ROUGE-L: Measures Longest Common Subsequence (LCS) between texts

For each variant, we calculate:

Precision: Proportion of n-grams in the generated summary that appear in the reference
Recall: Proportion of n-grams in the reference that appear in the generated summary
F1-score: Harmonic mean of precision and recall

Usage Example

from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.utils import add_custom_stopwords

metrics = ["rouge"]

full_text = "full text"
summary = "summary text"

metrics = evaluate_summary(
    full_text,
    summary,
    metrics=metrics,
)

print("\nEvaluation Metrics:")
# ROUGE-1 scores
print(f"ROUGE-1 Precision: {metrics['rouge1_precision']:.4f}")
print(f"ROUGE-1 Recall: {metrics['rouge1_recall']:.4f}")
print(f"ROUGE-1 F1: {metrics['rouge1_f1']:.4f}")

# ROUGE-2 scores
print(f"ROUGE-2 Precision: {metrics['rouge2_precision']:.4f}")
print(f"ROUGE-2 Recall: {metrics['rouge2_recall']:.4f}")
print(f"ROUGE-2 F1: {metrics['rouge2_f1']:.4f}")

# ROUGE-L scores
print(f"ROUGE-L Precision: {metrics['rougeL_precision']:.4f}")
print(f"ROUGE-L Recall: {metrics['rougeL_recall']:.4f}")
print(f"ROUGE-L F1: {metrics['rougeL_f1']:.4f}")

Interpretation

Each ROUGE variant provides different insights:

ROUGE-1:
- Good for assessing content coverage
- Less sensitive to word order
- Scores range from 0 to 1
ROUGE-2:
- Better at capturing fluency and readability
- More sensitive to word order
- Typically lower than ROUGE-1 scores
ROUGE-L:
- Captures longest in-sequence matches
- More flexible than fixed n-gram matches
- Good balance between ROUGE-1 and ROUGE-2

For all metrics:

Higher scores indicate better alignment with reference text
Precision, recall, and F1 provide different perspectives on quality

Limitations

Focuses on lexical overlap rather than semantic meaning
May miss semantically correct paraphrases
Requires human-written reference summaries
Different variants may give conflicting signals
No single "best" metric - consider multiple scores together

ROUGE Score

Overview​

Types of ROUGE​

Usage Example​

Interpretation​

Limitations​

Overview

Types of ROUGE

Usage Example

Interpretation

Limitations