Metrics

This section covers various metrics used to evaluate text quality and similarity. The metrics are divided into two main categories:

Summarization Metrics

Traditional metrics that don't require Language Models:

ROUGE Score (0-1): Measures overlap of n-grams between the reference text and generated summary
BLEU Score (0-1): Evaluates translation quality by comparing n-gram matches, with custom weights emphasizing unigrams and bigrams
BERT Score (0-1): Leverages contextual embeddings to better capture semantic similarity
BART Score (≤0): Uses BART's sequence-to-sequence model to evaluate semantic similarity and generation quality
COMET Score (0-1): Crosslingual Optimized Metric for Evaluation of Translation. Regression model trained on human judgments. Uses source, reference, and candidate as inputs to predict quality score

Advanced metrics that require an LLM provider:

Faithfulness (0-1): Measures factual consistency between summary and source text
Topic Preservation (0-1): Verifies that the most important topics from the source are retained in the summary
Redundancy Detection (0-1): Identifies and flags repeated information within summaries
Conciseness Assessment (0-1): Evaluates if the summary effectively condenses information without unnecessary verbosity

Metrics specifically designed for evaluating Retrieval-Augmented Generation:

Answer Attribution (0-1): Evaluates if the answer's claims are properly supported by the provided context
Answer Relevance (0-1): Measures how well the answer addresses the specific query intent
Completeness (0-1): Evaluates whether the answer addresses all aspects of the query comprehensively
Context Utilisation (0-1): Assesses how well the retrieved context aligns with and is applicable to the query
Faithfulness (0-1): Measures how accurately the answer reflects the information contained in the context without introducing external or contradictory information

Choose the appropriate metrics based on your evaluation needs and available resources.