COMET Score
COMET (Crosslingual Optimized Metric for Evaluation of Translation) is a neural framework for training multilingual machine translation evaluation models that can be adapted for summarization evaluation.
Overview
COMET leverages contextual embeddings and a neural regression model trained on human judgments to evaluate text quality. Unlike traditional metrics that rely on lexical overlap, COMET captures semantic similarity and quality aspects that correlate strongly with human evaluations.
How It Works
COMET operates by:
- Encoding the source text, candidate summary, and optionally a reference summary using a pre-trained language model
- Passing these encodings through a neural regression or quality estimation model
- Producing a quality score that predicts human judgment
The framework supports both reference-based evaluation (using source, reference, and candidate) and reference-free quality estimation (using only source and candidate).
Uses pre-trained models (wmt20-comet-da ~1.8GB download on first use)
Usage Example
from assert_llm_tools.core import evaluate_summary
metrics = ["comet_score", "comet_qe_score"]
full_text = "full text"
summary = "summary text"
metrics = evaluate_summary(
full_text,
summary,
metrics=metrics,
)
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")
Interpretation
- Scores typically range from 0 to 1
- Higher scores indicate better alignment with human judgments
comet_score
uses a reference-based approachcomet_qe_score
uses a reference-free quality estimation approach
Variants
- COMET-DA: Direct assessment model that predicts absolute quality scores
- COMET-HTER: Predicts human translation error rate
- COMET-QE: Reference-free quality estimation model
Benefits for Summarization
COMET is particularly valuable for summarization evaluation because:
- It captures semantic equivalence beyond lexical overlap
- It's been trained on human judgments, making it more aligned with human perception of quality
- It can evaluate aspects like fluency, coherence, and faithfulness simultaneously
- It works well for evaluating abstractive summaries where wording differs significantly from source
Fine-tuning for Summarization
While COMET was originally designed for machine translation, it can be fine-tuned specifically for summarization evaluation using summary-specific human judgments data, which can further improve its correlation with human evaluations of summary quality.