Coverage

Coverage measures the completeness/recall of summaries by verifying that claims from the reference text are present in the summary. This metric evaluates how comprehensively the summary captures the important information from the source.

Overview

This metric evaluates whether the summary includes the key information from the source text. It provides a recall score that helps identify when important information is missing from summaries, ensuring completeness of coverage.

How It Works

Coverage operates in three steps:

Claim Extraction: Uses an LLM to extract all factual claims from the reference text
- Each claim is an atomic, verifiable piece of information
- Compound claims are split into separate individual claims
- Only objective facts are extracted (opinions and judgments are excluded)
Claim Presence Check: Each extracted claim is checked against the summary
- The LLM determines if each reference claim appears in the summary (even if worded differently)
- Returns a binary decision (present/missing) for each claim

Score Calculation: Computes the recall score

Coverage = Claims in Summary / Total Reference Claims

This provides a measure of how complete the summary is - a recall metric that penalizes missing important information.

Usage

Here's how to evaluate coverage using Assert LLM Tools:

from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.llm.config import LLMConfig

# Configure LLM provider (choose one)
llm_config = LLMConfig(
    provider="bedrock",
    model_id="anthropic.claude-v2",
    region="us-east-1"
)

llm_config = LLMConfig(
    provider="openai",
    model_id="gpt-4-mini",
    api_key="your-api-key"
)

# Example texts
full_text = "The cat is black and sleeps on the windowsill during sunny afternoons. It enjoys watching birds and occasionally naps in the garden."
summary = "The black cat sleeps by the window."

# Evaluate coverage
metrics = evaluate_summary(
    full_text,
    summary,
    metrics=["coverage"],
    llm_config=llm_config
)

# Print results
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
    print(f"{metric}: {score:.4f}")

Verbose Mode

For detailed claim-level analysis, use the verbose parameter to see which specific claims are covered or missing:

# Get detailed claim analysis
metrics = evaluate_summary(
    full_text,
    summary,
    metrics=["coverage"],
    llm_config=llm_config,
    verbose=True
)

# Access detailed results
print(f"Coverage: {metrics['coverage']:.4f}")
print(f"\nReference claims: {metrics['reference_claims_count']}")
print(f"Claims in summary: {metrics['claims_in_summary_count']}")
print(f"Missing claims: {metrics['reference_claims_count'] - metrics['claims_in_summary_count']}")

# Detailed claim-level analysis
if 'claims_analysis' in metrics:
    print("\nDetailed Claim Analysis:")
    for i, claim_data in enumerate(metrics['claims_analysis'], 1):
        status = "✓ COVERED" if claim_data['is_covered'] else "✗ MISSING"
        print(f"{i}. {status}: {claim_data['claim']}")

Example Output

In the example above where the summary omits information about watching birds and napping in the garden, the verbose output would show:

Coverage: 0.5000

Reference claims: 4
Claims in summary: 2
Missing claims: 2

Detailed Claim Analysis:
1. ✓ COVERED: The cat is black
2. ✓ COVERED: The cat sleeps on the windowsill during sunny afternoons
3. ✗ MISSING: The cat enjoys watching birds
4. ✗ MISSING: The cat occasionally naps in the garden

Return Values

The metric returns a dictionary containing:

coverage: Score from 0-1 (claims_in_summary / total_reference_claims)
reference_claims_count: Total claims extracted from reference
claims_in_summary_count: Number of reference claims present in summary
claims_analysis (only if verbose=True): List of dicts with:
- claim: The extracted claim text from the reference
- is_covered: Boolean indicating if the claim appears in the summary

Interpretation

The coverage score ranges from 0 to 1:

1.0: Perfect recall - all reference claims are present in the summary
0.8-0.99: Excellent - most claims covered, minor omissions
0.6-0.79: Good - reasonable coverage but some important information missing
0.4-0.59: Fair - significant information gaps
Below 0.4: Poor - many important claims missing from summary

Special Cases

Score of 1.0 with 0 claims: If the reference contains no extractable claims, the score is 1.0 (perfect coverage by default)
High score with very short summary: May indicate the reference had few key claims to begin with

When to Use

Use coverage when:

You need to ensure comprehensive inclusion of source information
Recall is more important than precision (prefer completeness over brevity)
Evaluating whether summaries miss critical information
Assessing information retention in summarization systems
Comparing models for their tendency to omit important details
Working in domains where completeness is critical (medical, legal, technical documentation)

vs. Factual Consistency: Coverage measures recall (source claims in summary), while Factual Consistency measures precision (summary claims supported by source)
vs. Faithfulness: Faithfulness evaluates overall semantic consistency, while Coverage explicitly checks for presence of individual source claims
vs. Factual Alignment: Factual Alignment combines both coverage (recall) and factual consistency (precision) into an F1 score

Custom Instructions

You can provide custom instructions to guide the LLM's evaluation:

custom_instruction = "Focus on technical specifications and numerical data. A claim is only covered if specific numbers are preserved."

metrics = evaluate_summary(
    technical_doc,
    technical_summary,
    metrics=["coverage"],
    llm_config=llm_config,
    custom_instruction=custom_instruction
)

Limitations

Requires an LLM provider, which may incur costs
Results may vary depending on the LLM model used for claim extraction and verification
Claim extraction quality depends on the LLM's ability to identify key facts
May miss nuanced forms of information omission that require deep reasoning
The metric is conservative - paraphrased claims should still be marked as present, but this depends on LLM interpretation
Does not evaluate accuracy (precision) - a summary with all claims plus hallucinations scores high

Best Practices

Use verbose mode during development to understand which specific claims are being missed
Combine with factual consistency to ensure both completeness and accuracy
Use custom instructions for domain-specific evaluation criteria
Monitor claim counts - very low claim counts from the reference may indicate extraction issues
Consider the source length - longer sources may have many claims, making high coverage difficult

Understanding Coverage vs Length Trade-offs

Coverage and conciseness often work in opposition:

High coverage, long summary: Comprehensive but potentially verbose
Low coverage, short summary: Concise but potentially missing key information
Ideal balance: Use Factual Alignment to optimize both coverage and accuracy

Factual Alignment: F1 score combining coverage (recall) and factual consistency (precision)
Factual Consistency: Measures precision (summary claims supported by source)
Conciseness: Evaluates brevity and efficiency of summaries (often inversely related to coverage)

Coverage

Overview​

How It Works​

Usage​

Verbose Mode​

Example Output​

Return Values​

Interpretation​

Special Cases​

When to Use​

Comparison with Related Metrics​

Custom Instructions​

Limitations​

Best Practices​

Understanding Coverage vs Length Trade-offs​

Related Metrics​