Metrics
This section covers various metrics used to evaluate text quality and similarity. The metrics are divided into two main categories:
Summarization Metrics
Non-LLM Metrics
Traditional metrics that don't require Language Models:
- ROUGE Score (0-1): Measures overlap of n-grams between the reference text and generated summary
- BLEU Score (0-1): Evaluates translation quality by comparing n-gram matches, with custom weights emphasizing unigrams and bigrams
- BERT Score (0-1): Leverages contextual embeddings to better capture semantic similarity
- BART Score (≤0): Uses BART's sequence-to-sequence model to evaluate semantic similarity and generation quality
- COMET Score (0-1): Crosslingual Optimized Metric for Evaluation of Translation. Regression model trained on human judgments. Uses source, reference, and candidate as inputs to predict quality score
LLM-Based Metrics
Advanced metrics that require an LLM provider:
- Faithfulness (0-1): Measures factual consistency between summary and source text
- Topic Preservation (0-1): Verifies that the most important topics from the source are retained in the summary
- Redundancy Detection (0-1): Identifies and flags repeated information within summaries
- Conciseness Assessment (0-1): Evaluates if the summary effectively condenses information without unnecessary verbosity
RAG Metrics
Metrics specifically designed for evaluating Retrieval-Augmented Generation:
- Answer Attribution (0-1): Evaluates if the answer's claims are properly supported by the provided context
- Answer Relevance (0-1): Measures how well the answer addresses the specific query intent
- Completeness (0-1): Evaluates whether the answer addresses all aspects of the query comprehensively
- Context Utilisation (0-1): Assesses how well the retrieved context aligns with and is applicable to the query
- Faithfulness (0-1): Measures how accurately the answer reflects the information contained in the context without introducing external or contradictory information
Choose the appropriate metrics based on your evaluation needs and available resources.