Skip to main content

ROUGE Score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation.

Overview

ROUGE measures the quality of a summary by comparing it to reference summaries created by humans. It counts the number of overlapping units such as n-grams, word sequences, and word pairs between the generated summary and the reference summaries.

Types of ROUGE

We implement three main ROUGE variants:

  • ROUGE-1: Measures unigram (single word) overlap
  • ROUGE-2: Measures bigram (two consecutive words) overlap
  • ROUGE-L: Measures Longest Common Subsequence (LCS) between texts

For each variant, we calculate:

  • Precision: Proportion of n-grams in the generated summary that appear in the reference
  • Recall: Proportion of n-grams in the reference that appear in the generated summary
  • F1-score: Harmonic mean of precision and recall

Usage Example

from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.utils import add_custom_stopwords

metrics = ["rouge"]

full_text = "full text"
summary = "summary text"

metrics = evaluate_summary(
full_text,
summary,
metrics=metrics,
)

print("\nEvaluation Metrics:")
# ROUGE-1 scores
print(f"ROUGE-1 Precision: {metrics['rouge1_precision']:.4f}")
print(f"ROUGE-1 Recall: {metrics['rouge1_recall']:.4f}")
print(f"ROUGE-1 F1: {metrics['rouge1_f1']:.4f}")

# ROUGE-2 scores
print(f"ROUGE-2 Precision: {metrics['rouge2_precision']:.4f}")
print(f"ROUGE-2 Recall: {metrics['rouge2_recall']:.4f}")
print(f"ROUGE-2 F1: {metrics['rouge2_f1']:.4f}")

# ROUGE-L scores
print(f"ROUGE-L Precision: {metrics['rougeL_precision']:.4f}")
print(f"ROUGE-L Recall: {metrics['rougeL_recall']:.4f}")
print(f"ROUGE-L F1: {metrics['rougeL_f1']:.4f}")

Interpretation

Each ROUGE variant provides different insights:

  • ROUGE-1:

    • Good for assessing content coverage
    • Less sensitive to word order
    • Scores range from 0 to 1
  • ROUGE-2:

    • Better at capturing fluency and readability
    • More sensitive to word order
    • Typically lower than ROUGE-1 scores
  • ROUGE-L:

    • Captures longest in-sequence matches
    • More flexible than fixed n-gram matches
    • Good balance between ROUGE-1 and ROUGE-2

For all metrics:

  • Higher scores indicate better alignment with reference text
  • Precision, recall, and F1 provide different perspectives on quality

Limitations

  • Focuses on lexical overlap rather than semantic meaning
  • May miss semantically correct paraphrases
  • Requires human-written reference summaries
  • Different variants may give conflicting signals
  • No single "best" metric - consider multiple scores together