Factual Consistency

Factual Consistency measures the precision/accuracy of summaries by verifying that claims in the summary are supported by the reference text. Unlike faithfulness which focuses on semantic consistency, this metric specifically extracts and validates individual factual claims.

Overview

This metric evaluates whether the generated summary contains only information that is supported by or can be directly inferred from the source text. It provides a precision score that helps identify potential hallucinations or unsupported claims in summaries.

How It Works

Factual Consistency operates in three steps:

Claim Extraction: Uses an LLM to extract all factual claims from the summary
- Each claim is an atomic, verifiable piece of information
- Compound claims are split into separate individual claims
- Only objective facts are extracted (opinions and judgments are excluded)
Claim Verification: Each extracted claim is verified against the reference text
- The LLM determines if each claim is supported by or can be inferred from the source
- Returns a binary decision (supported/unsupported) for each claim

Score Calculation: Computes the precision score

Factual Consistency = Supported Claims / Total Summary Claims

This provides a measure of how accurate and grounded the summary is - a precision metric that penalizes hallucinations and unsupported statements.

Usage

Here's how to evaluate factual consistency using Assert LLM Tools:

from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.llm.config import LLMConfig

# Configure LLM provider (choose one)
llm_config = LLMConfig(
    provider="bedrock",
    model_id="anthropic.claude-v2",
    region="us-east-1"
)

llm_config = LLMConfig(
    provider="openai",
    model_id="gpt-4-mini",
    api_key="your-api-key"
)

# Example texts
full_text = "The cat is black and sleeps on the windowsill during sunny afternoons. It enjoys watching birds."
summary = "The black cat sleeps by the window when it's sunny and catches mice."

# Evaluate factual consistency
metrics = evaluate_summary(
    full_text,
    summary,
    metrics=["factual_consistency"],
    llm_config=llm_config
)

# Print results
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
    print(f"{metric}: {score:.4f}")

Verbose Mode

For detailed claim-level analysis, use the verbose parameter to see which specific claims are supported or unsupported:

# Get detailed claim analysis
metrics = evaluate_summary(
    full_text,
    summary,
    metrics=["factual_consistency"],
    llm_config=llm_config,
    verbose=True
)

# Access detailed results
print(f"Factual Consistency: {metrics['factual_consistency']:.4f}")
print(f"\nSummary claims: {metrics['summary_claims_count']}")
print(f"Supported claims: {metrics['supported_claims_count']}")
print(f"Unsupported claims: {metrics['unsupported_claims_count']}")

# Detailed claim-level analysis
if 'claims_analysis' in metrics:
    print("\nDetailed Claim Analysis:")
    for i, claim_data in enumerate(metrics['claims_analysis'], 1):
        status = "✓ SUPPORTED" if claim_data['is_supported'] else "✗ UNSUPPORTED"
        print(f"{i}. {status}: {claim_data['claim']}")

Example Output

In the example above where the summary claims "catches mice" (which is not in the source), the verbose output would show:

Factual Consistency: 0.6667

Summary claims: 3
Supported claims: 2
Unsupported claims: 1

Detailed Claim Analysis:
1. ✓ SUPPORTED: The cat is black
2. ✓ SUPPORTED: The cat sleeps by the window when it's sunny
3. ✗ UNSUPPORTED: The cat catches mice

Return Values

The metric returns a dictionary containing:

factual_consistency: Score from 0-1 (supported_claims / total_summary_claims)
summary_claims_count: Total claims extracted from summary
supported_claims_count: Number of summary claims supported by reference
unsupported_claims_count: Number of summary claims not supported by reference
claims_analysis (only if verbose=True): List of dicts with:
- claim: The extracted claim text
- is_supported: Boolean indicating if the claim is supported

Interpretation

The factual consistency score ranges from 0 to 1:

1.0: Perfect precision - all summary claims are supported by the source
0.8-0.99: Excellent - very few unsupported claims
0.6-0.79: Good - some unsupported claims but mostly accurate
0.4-0.59: Fair - significant number of unsupported claims
Below 0.4: Poor - many hallucinations or unsupported statements

Special Cases

Score of 1.0 with 0 claims: If the summary contains no extractable claims, the score is 1.0 (perfect consistency by default)
High score, low informativeness: A summary with very few claims may score high but not be useful

When to Use

Use factual consistency when:

You need to detect and quantify hallucinations in summaries
Precision is more important than completeness (prefer accuracy over coverage)
Evaluating abstractive summarization models that may introduce new information
Ensuring factual accuracy in high-stakes domains (medical, legal, financial)
Comparing different models for their tendency to hallucinate

vs. Faithfulness: Faithfulness evaluates overall semantic consistency, while Factual Consistency explicitly extracts and verifies individual claims
vs. Coverage: Coverage measures recall (how much of source is in summary), while Factual Consistency measures precision (how much of summary is supported)
vs. Factual Alignment: Factual Alignment combines both coverage (recall) and factual consistency (precision) into an F1 score

Custom Instructions

You can provide custom instructions to guide the LLM's evaluation:

custom_instruction = "Pay special attention to numerical values and dates. A claim is only supported if numbers match exactly."

metrics = evaluate_summary(
    financial_report,
    financial_summary,
    metrics=["factual_consistency"],
    llm_config=llm_config,
    custom_instruction=custom_instruction
)

Limitations

Requires an LLM provider, which may incur costs
Results may vary depending on the LLM model used for claim extraction and verification
Claim extraction quality depends on the LLM's ability to identify atomic facts
May miss nuanced forms of inconsistency that require deep reasoning
The metric is conservative - ambiguous claims are often marked as supported
Does not evaluate coverage (completeness) - a summary with few but accurate claims scores high

Best Practices

Use verbose mode during development to understand which specific claims are failing
Combine with coverage metrics to ensure both accuracy and completeness
Use custom instructions for domain-specific evaluation criteria
Test with edge cases to understand how the metric handles your specific use case
Monitor claim counts - very low claim counts may indicate extraction issues

Factual Alignment: F1 score combining coverage and factual consistency
Faithfulness: Semantic consistency between summary and source
Coverage: Measures how much of the source is included in the summary (recall)

Factual Consistency

Overview​

How It Works​

Usage​

Verbose Mode​

Example Output​

Return Values​

Interpretation​

Special Cases​

When to Use​

Comparison with Related Metrics​

Custom Instructions​

Limitations​

Best Practices​

Related Metrics​