Coverage
Coverage measures the completeness/recall of summaries by verifying that claims from the reference text are present in the summary. This metric evaluates how comprehensively the summary captures the important information from the source.
Overview
This metric evaluates whether the summary includes the key information from the source text. It provides a recall score that helps identify when important information is missing from summaries, ensuring completeness of coverage.
How It Works
Coverage operates in three steps:
Claim Extraction: Uses an LLM to extract all factual claims from the reference text
- Each claim is an atomic, verifiable piece of information
- Compound claims are split into separate individual claims
- Only objective facts are extracted (opinions and judgments are excluded)
Claim Presence Check: Each extracted claim is checked against the summary
- The LLM determines if each reference claim appears in the summary (even if worded differently)
- Returns a binary decision (present/missing) for each claim
Score Calculation: Computes the recall score
Coverage = Claims in Summary / Total Reference Claims
This provides a measure of how complete the summary is - a recall metric that penalizes missing important information.
Usage
Here's how to evaluate coverage using Assert LLM Tools:
from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.llm.config import LLMConfig
# Configure LLM provider (choose one)
llm_config = LLMConfig(
provider="bedrock",
model_id="anthropic.claude-v2",
region="us-east-1"
)
llm_config = LLMConfig(
provider="openai",
model_id="gpt-4-mini",
api_key="your-api-key"
)
# Example texts
full_text = "The cat is black and sleeps on the windowsill during sunny afternoons. It enjoys watching birds and occasionally naps in the garden."
summary = "The black cat sleeps by the window."
# Evaluate coverage
metrics = evaluate_summary(
full_text,
summary,
metrics=["coverage"],
llm_config=llm_config
)
# Print results
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")
Verbose Mode
For detailed claim-level analysis, use the verbose parameter to see which specific claims are covered or missing:
# Get detailed claim analysis
metrics = evaluate_summary(
full_text,
summary,
metrics=["coverage"],
llm_config=llm_config,
verbose=True
)
# Access detailed results
print(f"Coverage: {metrics['coverage']:.4f}")
print(f"\nReference claims: {metrics['reference_claims_count']}")
print(f"Claims in summary: {metrics['claims_in_summary_count']}")
print(f"Missing claims: {metrics['reference_claims_count'] - metrics['claims_in_summary_count']}")
# Detailed claim-level analysis
if 'claims_analysis' in metrics:
print("\nDetailed Claim Analysis:")
for i, claim_data in enumerate(metrics['claims_analysis'], 1):
status = "✓ COVERED" if claim_data['is_covered'] else "✗ MISSING"
print(f"{i}. {status}: {claim_data['claim']}")
Example Output
In the example above where the summary omits information about watching birds and napping in the garden, the verbose output would show:
Coverage: 0.5000
Reference claims: 4
Claims in summary: 2
Missing claims: 2
Detailed Claim Analysis:
1. ✓ COVERED: The cat is black
2. ✓ COVERED: The cat sleeps on the windowsill during sunny afternoons
3. ✗ MISSING: The cat enjoys watching birds
4. ✗ MISSING: The cat occasionally naps in the garden
Return Values
The metric returns a dictionary containing:
coverage: Score from 0-1 (claims_in_summary / total_reference_claims)reference_claims_count: Total claims extracted from referenceclaims_in_summary_count: Number of reference claims present in summaryclaims_analysis(only if verbose=True): List of dicts with:claim: The extracted claim text from the referenceis_covered: Boolean indicating if the claim appears in the summary
Interpretation
The coverage score ranges from 0 to 1:
- 1.0: Perfect recall - all reference claims are present in the summary
- 0.8-0.99: Excellent - most claims covered, minor omissions
- 0.6-0.79: Good - reasonable coverage but some important information missing
- 0.4-0.59: Fair - significant information gaps
- Below 0.4: Poor - many important claims missing from summary
Special Cases
- Score of 1.0 with 0 claims: If the reference contains no extractable claims, the score is 1.0 (perfect coverage by default)
- High score with very short summary: May indicate the reference had few key claims to begin with
When to Use
Use coverage when:
- You need to ensure comprehensive inclusion of source information
- Recall is more important than precision (prefer completeness over brevity)
- Evaluating whether summaries miss critical information
- Assessing information retention in summarization systems
- Comparing models for their tendency to omit important details
- Working in domains where completeness is critical (medical, legal, technical documentation)
Comparison with Related Metrics
- vs. Factual Consistency: Coverage measures recall (source claims in summary), while Factual Consistency measures precision (summary claims supported by source)
- vs. Faithfulness: Faithfulness evaluates overall semantic consistency, while Coverage explicitly checks for presence of individual source claims
- vs. Factual Alignment: Factual Alignment combines both coverage (recall) and factual consistency (precision) into an F1 score
Custom Instructions
You can provide custom instructions to guide the LLM's evaluation:
custom_instruction = "Focus on technical specifications and numerical data. A claim is only covered if specific numbers are preserved."
metrics = evaluate_summary(
technical_doc,
technical_summary,
metrics=["coverage"],
llm_config=llm_config,
custom_instruction=custom_instruction
)
Limitations
- Requires an LLM provider, which may incur costs
- Results may vary depending on the LLM model used for claim extraction and verification
- Claim extraction quality depends on the LLM's ability to identify key facts
- May miss nuanced forms of information omission that require deep reasoning
- The metric is conservative - paraphrased claims should still be marked as present, but this depends on LLM interpretation
- Does not evaluate accuracy (precision) - a summary with all claims plus hallucinations scores high
Best Practices
- Use verbose mode during development to understand which specific claims are being missed
- Combine with factual consistency to ensure both completeness and accuracy
- Use custom instructions for domain-specific evaluation criteria
- Monitor claim counts - very low claim counts from the reference may indicate extraction issues
- Consider the source length - longer sources may have many claims, making high coverage difficult
Understanding Coverage vs Length Trade-offs
Coverage and conciseness often work in opposition:
- High coverage, long summary: Comprehensive but potentially verbose
- Low coverage, short summary: Concise but potentially missing key information
- Ideal balance: Use Factual Alignment to optimize both coverage and accuracy
Related Metrics
- Factual Alignment: F1 score combining coverage (recall) and factual consistency (precision)
- Factual Consistency: Measures precision (summary claims supported by source)
- Conciseness: Evaluates brevity and efficiency of summaries (often inversely related to coverage)