COMET Score

COMET (Crosslingual Optimized Metric for Evaluation of Translation) is a neural framework for training multilingual machine translation evaluation models that can be adapted for summarization evaluation.

Overview

COMET leverages contextual embeddings and a neural regression model trained on human judgments to evaluate text quality. Unlike traditional metrics that rely on lexical overlap, COMET captures semantic similarity and quality aspects that correlate strongly with human evaluations.

How It Works

COMET operates by:

Encoding the source text, candidate summary, and optionally a reference summary using a pre-trained language model
Passing these encodings through a neural regression or quality estimation model
Producing a quality score that predicts human judgment

The framework supports both reference-based evaluation (using source, reference, and candidate) and reference-free quality estimation (using only source and candidate).

Uses pre-trained models (wmt20-comet-da ~1.8GB download on first use)

Usage Example

from assert_llm_tools.core import evaluate_summary

metrics = ["comet_score", "comet_qe_score"]
full_text = "full text"
summary = "summary text"

metrics = evaluate_summary(
    full_text,
    summary,
    metrics=metrics,
)

print("\nEvaluation Metrics:")
for metric, score in metrics.items():
    print(f"{metric}: {score:.4f}")

Interpretation

Scores typically range from 0 to 1
Higher scores indicate better alignment with human judgments
comet_score uses a reference-based approach
comet_qe_score uses a reference-free quality estimation approach

Variants

COMET-DA: Direct assessment model that predicts absolute quality scores
COMET-HTER: Predicts human translation error rate
COMET-QE: Reference-free quality estimation model

Benefits for Summarization

COMET is particularly valuable for summarization evaluation because:

It captures semantic equivalence beyond lexical overlap
It's been trained on human judgments, making it more aligned with human perception of quality
It can evaluate aspects like fluency, coherence, and faithfulness simultaneously
It works well for evaluating abstractive summaries where wording differs significantly from source

Fine-tuning for Summarization

While COMET was originally designed for machine translation, it can be fine-tuned specifically for summarization evaluation using summary-specific human judgments data, which can further improve its correlation with human evaluations of summary quality.

COMET Score

Overview​

How It Works​

Usage Example​

Interpretation​

Variants​

Benefits for Summarization​

Fine-tuning for Summarization​