Skip to main content

COMET Score

COMET (Crosslingual Optimized Metric for Evaluation of Translation) is a neural framework for training multilingual machine translation evaluation models that can be adapted for summarization evaluation.

Overview

COMET leverages contextual embeddings and a neural regression model trained on human judgments to evaluate text quality. Unlike traditional metrics that rely on lexical overlap, COMET captures semantic similarity and quality aspects that correlate strongly with human evaluations.

How It Works

COMET operates by:

  1. Encoding the source text, candidate summary, and optionally a reference summary using a pre-trained language model
  2. Passing these encodings through a neural regression or quality estimation model
  3. Producing a quality score that predicts human judgment

The framework supports both reference-based evaluation (using source, reference, and candidate) and reference-free quality estimation (using only source and candidate).

Uses pre-trained models (wmt20-comet-da ~1.8GB download on first use)

Usage Example

from assert_llm_tools.core import evaluate_summary

metrics = ["comet_score", "comet_qe_score"]
full_text = "full text"
summary = "summary text"

metrics = evaluate_summary(
full_text,
summary,
metrics=metrics,
)

print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")

Interpretation

  • Scores typically range from 0 to 1
  • Higher scores indicate better alignment with human judgments
  • comet_score uses a reference-based approach
  • comet_qe_score uses a reference-free quality estimation approach

Variants

  • COMET-DA: Direct assessment model that predicts absolute quality scores
  • COMET-HTER: Predicts human translation error rate
  • COMET-QE: Reference-free quality estimation model

Benefits for Summarization

COMET is particularly valuable for summarization evaluation because:

  1. It captures semantic equivalence beyond lexical overlap
  2. It's been trained on human judgments, making it more aligned with human perception of quality
  3. It can evaluate aspects like fluency, coherence, and faithfulness simultaneously
  4. It works well for evaluating abstractive summaries where wording differs significantly from source

Fine-tuning for Summarization

While COMET was originally designed for machine translation, it can be fine-tuned specifically for summarization evaluation using summary-specific human judgments data, which can further improve its correlation with human evaluations of summary quality.