Skip to main content

Factual Consistency

Factual Consistency measures the precision/accuracy of summaries by verifying that claims in the summary are supported by the reference text. Unlike faithfulness which focuses on semantic consistency, this metric specifically extracts and validates individual factual claims.

Overview

This metric evaluates whether the generated summary contains only information that is supported by or can be directly inferred from the source text. It provides a precision score that helps identify potential hallucinations or unsupported claims in summaries.

How It Works

Factual Consistency operates in three steps:

  1. Claim Extraction: Uses an LLM to extract all factual claims from the summary

    • Each claim is an atomic, verifiable piece of information
    • Compound claims are split into separate individual claims
    • Only objective facts are extracted (opinions and judgments are excluded)
  2. Claim Verification: Each extracted claim is verified against the reference text

    • The LLM determines if each claim is supported by or can be inferred from the source
    • Returns a binary decision (supported/unsupported) for each claim
  3. Score Calculation: Computes the precision score

    Factual Consistency = Supported Claims / Total Summary Claims

This provides a measure of how accurate and grounded the summary is - a precision metric that penalizes hallucinations and unsupported statements.

Usage

Here's how to evaluate factual consistency using Assert LLM Tools:

from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.llm.config import LLMConfig

# Configure LLM provider (choose one)
llm_config = LLMConfig(
provider="bedrock",
model_id="anthropic.claude-v2",
region="us-east-1"
)

llm_config = LLMConfig(
provider="openai",
model_id="gpt-4-mini",
api_key="your-api-key"
)

# Example texts
full_text = "The cat is black and sleeps on the windowsill during sunny afternoons. It enjoys watching birds."
summary = "The black cat sleeps by the window when it's sunny and catches mice."

# Evaluate factual consistency
metrics = evaluate_summary(
full_text,
summary,
metrics=["factual_consistency"],
llm_config=llm_config
)

# Print results
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")

Verbose Mode

For detailed claim-level analysis, use the verbose parameter to see which specific claims are supported or unsupported:

# Get detailed claim analysis
metrics = evaluate_summary(
full_text,
summary,
metrics=["factual_consistency"],
llm_config=llm_config,
verbose=True
)

# Access detailed results
print(f"Factual Consistency: {metrics['factual_consistency']:.4f}")
print(f"\nSummary claims: {metrics['summary_claims_count']}")
print(f"Supported claims: {metrics['supported_claims_count']}")
print(f"Unsupported claims: {metrics['unsupported_claims_count']}")

# Detailed claim-level analysis
if 'claims_analysis' in metrics:
print("\nDetailed Claim Analysis:")
for i, claim_data in enumerate(metrics['claims_analysis'], 1):
status = "✓ SUPPORTED" if claim_data['is_supported'] else "✗ UNSUPPORTED"
print(f"{i}. {status}: {claim_data['claim']}")

Example Output

In the example above where the summary claims "catches mice" (which is not in the source), the verbose output would show:

Factual Consistency: 0.6667

Summary claims: 3
Supported claims: 2
Unsupported claims: 1

Detailed Claim Analysis:
1. ✓ SUPPORTED: The cat is black
2. ✓ SUPPORTED: The cat sleeps by the window when it's sunny
3. ✗ UNSUPPORTED: The cat catches mice

Return Values

The metric returns a dictionary containing:

  • factual_consistency: Score from 0-1 (supported_claims / total_summary_claims)
  • summary_claims_count: Total claims extracted from summary
  • supported_claims_count: Number of summary claims supported by reference
  • unsupported_claims_count: Number of summary claims not supported by reference
  • claims_analysis (only if verbose=True): List of dicts with:
    • claim: The extracted claim text
    • is_supported: Boolean indicating if the claim is supported

Interpretation

The factual consistency score ranges from 0 to 1:

  • 1.0: Perfect precision - all summary claims are supported by the source
  • 0.8-0.99: Excellent - very few unsupported claims
  • 0.6-0.79: Good - some unsupported claims but mostly accurate
  • 0.4-0.59: Fair - significant number of unsupported claims
  • Below 0.4: Poor - many hallucinations or unsupported statements

Special Cases

  • Score of 1.0 with 0 claims: If the summary contains no extractable claims, the score is 1.0 (perfect consistency by default)
  • High score, low informativeness: A summary with very few claims may score high but not be useful

When to Use

Use factual consistency when:

  • You need to detect and quantify hallucinations in summaries
  • Precision is more important than completeness (prefer accuracy over coverage)
  • Evaluating abstractive summarization models that may introduce new information
  • Ensuring factual accuracy in high-stakes domains (medical, legal, financial)
  • Comparing different models for their tendency to hallucinate
  • vs. Faithfulness: Faithfulness evaluates overall semantic consistency, while Factual Consistency explicitly extracts and verifies individual claims
  • vs. Coverage: Coverage measures recall (how much of source is in summary), while Factual Consistency measures precision (how much of summary is supported)
  • vs. Factual Alignment: Factual Alignment combines both coverage (recall) and factual consistency (precision) into an F1 score

Custom Instructions

You can provide custom instructions to guide the LLM's evaluation:

custom_instruction = "Pay special attention to numerical values and dates. A claim is only supported if numbers match exactly."

metrics = evaluate_summary(
financial_report,
financial_summary,
metrics=["factual_consistency"],
llm_config=llm_config,
custom_instruction=custom_instruction
)

Limitations

  • Requires an LLM provider, which may incur costs
  • Results may vary depending on the LLM model used for claim extraction and verification
  • Claim extraction quality depends on the LLM's ability to identify atomic facts
  • May miss nuanced forms of inconsistency that require deep reasoning
  • The metric is conservative - ambiguous claims are often marked as supported
  • Does not evaluate coverage (completeness) - a summary with few but accurate claims scores high

Best Practices

  1. Use verbose mode during development to understand which specific claims are failing
  2. Combine with coverage metrics to ensure both accuracy and completeness
  3. Use custom instructions for domain-specific evaluation criteria
  4. Test with edge cases to understand how the metric handles your specific use case
  5. Monitor claim counts - very low claim counts may indicate extraction issues
  • Factual Alignment: F1 score combining coverage and factual consistency
  • Faithfulness: Semantic consistency between summary and source
  • Coverage: Measures how much of the source is included in the summary (recall)