Factual Consistency
Factual Consistency measures the precision/accuracy of summaries by verifying that claims in the summary are supported by the reference text. Unlike faithfulness which focuses on semantic consistency, this metric specifically extracts and validates individual factual claims.
Overview
This metric evaluates whether the generated summary contains only information that is supported by or can be directly inferred from the source text. It provides a precision score that helps identify potential hallucinations or unsupported claims in summaries.
How It Works
Factual Consistency operates in three steps:
Claim Extraction: Uses an LLM to extract all factual claims from the summary
- Each claim is an atomic, verifiable piece of information
- Compound claims are split into separate individual claims
- Only objective facts are extracted (opinions and judgments are excluded)
Claim Verification: Each extracted claim is verified against the reference text
- The LLM determines if each claim is supported by or can be inferred from the source
- Returns a binary decision (supported/unsupported) for each claim
Score Calculation: Computes the precision score
Factual Consistency = Supported Claims / Total Summary Claims
This provides a measure of how accurate and grounded the summary is - a precision metric that penalizes hallucinations and unsupported statements.
Usage
Here's how to evaluate factual consistency using Assert LLM Tools:
from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.llm.config import LLMConfig
# Configure LLM provider (choose one)
llm_config = LLMConfig(
provider="bedrock",
model_id="anthropic.claude-v2",
region="us-east-1"
)
llm_config = LLMConfig(
provider="openai",
model_id="gpt-4-mini",
api_key="your-api-key"
)
# Example texts
full_text = "The cat is black and sleeps on the windowsill during sunny afternoons. It enjoys watching birds."
summary = "The black cat sleeps by the window when it's sunny and catches mice."
# Evaluate factual consistency
metrics = evaluate_summary(
full_text,
summary,
metrics=["factual_consistency"],
llm_config=llm_config
)
# Print results
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")
Verbose Mode
For detailed claim-level analysis, use the verbose parameter to see which specific claims are supported or unsupported:
# Get detailed claim analysis
metrics = evaluate_summary(
full_text,
summary,
metrics=["factual_consistency"],
llm_config=llm_config,
verbose=True
)
# Access detailed results
print(f"Factual Consistency: {metrics['factual_consistency']:.4f}")
print(f"\nSummary claims: {metrics['summary_claims_count']}")
print(f"Supported claims: {metrics['supported_claims_count']}")
print(f"Unsupported claims: {metrics['unsupported_claims_count']}")
# Detailed claim-level analysis
if 'claims_analysis' in metrics:
print("\nDetailed Claim Analysis:")
for i, claim_data in enumerate(metrics['claims_analysis'], 1):
status = "✓ SUPPORTED" if claim_data['is_supported'] else "✗ UNSUPPORTED"
print(f"{i}. {status}: {claim_data['claim']}")
Example Output
In the example above where the summary claims "catches mice" (which is not in the source), the verbose output would show:
Factual Consistency: 0.6667
Summary claims: 3
Supported claims: 2
Unsupported claims: 1
Detailed Claim Analysis:
1. ✓ SUPPORTED: The cat is black
2. ✓ SUPPORTED: The cat sleeps by the window when it's sunny
3. ✗ UNSUPPORTED: The cat catches mice
Return Values
The metric returns a dictionary containing:
factual_consistency: Score from 0-1 (supported_claims / total_summary_claims)summary_claims_count: Total claims extracted from summarysupported_claims_count: Number of summary claims supported by referenceunsupported_claims_count: Number of summary claims not supported by referenceclaims_analysis(only if verbose=True): List of dicts with:claim: The extracted claim textis_supported: Boolean indicating if the claim is supported
Interpretation
The factual consistency score ranges from 0 to 1:
- 1.0: Perfect precision - all summary claims are supported by the source
- 0.8-0.99: Excellent - very few unsupported claims
- 0.6-0.79: Good - some unsupported claims but mostly accurate
- 0.4-0.59: Fair - significant number of unsupported claims
- Below 0.4: Poor - many hallucinations or unsupported statements
Special Cases
- Score of 1.0 with 0 claims: If the summary contains no extractable claims, the score is 1.0 (perfect consistency by default)
- High score, low informativeness: A summary with very few claims may score high but not be useful
When to Use
Use factual consistency when:
- You need to detect and quantify hallucinations in summaries
- Precision is more important than completeness (prefer accuracy over coverage)
- Evaluating abstractive summarization models that may introduce new information
- Ensuring factual accuracy in high-stakes domains (medical, legal, financial)
- Comparing different models for their tendency to hallucinate
Comparison with Related Metrics
- vs. Faithfulness: Faithfulness evaluates overall semantic consistency, while Factual Consistency explicitly extracts and verifies individual claims
- vs. Coverage: Coverage measures recall (how much of source is in summary), while Factual Consistency measures precision (how much of summary is supported)
- vs. Factual Alignment: Factual Alignment combines both coverage (recall) and factual consistency (precision) into an F1 score
Custom Instructions
You can provide custom instructions to guide the LLM's evaluation:
custom_instruction = "Pay special attention to numerical values and dates. A claim is only supported if numbers match exactly."
metrics = evaluate_summary(
financial_report,
financial_summary,
metrics=["factual_consistency"],
llm_config=llm_config,
custom_instruction=custom_instruction
)
Limitations
- Requires an LLM provider, which may incur costs
- Results may vary depending on the LLM model used for claim extraction and verification
- Claim extraction quality depends on the LLM's ability to identify atomic facts
- May miss nuanced forms of inconsistency that require deep reasoning
- The metric is conservative - ambiguous claims are often marked as supported
- Does not evaluate coverage (completeness) - a summary with few but accurate claims scores high
Best Practices
- Use verbose mode during development to understand which specific claims are failing
- Combine with coverage metrics to ensure both accuracy and completeness
- Use custom instructions for domain-specific evaluation criteria
- Test with edge cases to understand how the metric handles your specific use case
- Monitor claim counts - very low claim counts may indicate extraction issues
Related Metrics
- Factual Alignment: F1 score combining coverage and factual consistency
- Faithfulness: Semantic consistency between summary and source
- Coverage: Measures how much of the source is included in the summary (recall)