Skip to main content

Factual Alignment

Factual Alignment provides a balanced measure of summary quality by calculating the F1 score that combines coverage (recall) and factual consistency (precision).

Overview

This metric addresses a key challenge in summarization evaluation: balancing completeness and accuracy. A good summary should both capture the important information from the source (coverage) and ensure that everything it states is factually supported (consistency). The F1 score provides a single metric that achieves this balance.

How It Works

Factual Alignment combines two complementary metrics:

  1. Coverage (Recall): What percentage of source claims appear in the summary
  2. Factual Consistency (Precision): What percentage of summary claims are supported by the source

The F1 score is the harmonic mean of these two metrics:

F1 = 2 × (precision × recall) / (precision + recall)

This is particularly useful when you want to ensure summaries are both comprehensive and factually grounded, preventing both information loss and hallucinations.

Usage

Here's how to evaluate factual alignment using Assert LLM Tools:

from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.llm.config import LLMConfig

# Configure LLM provider (choose one)
llm_config = LLMConfig(
provider="bedrock",
model_id="anthropic.claude-v2",
region="us-east-1"
)

llm_config = LLMConfig(
provider="openai",
model_id="gpt-4-mini",
api_key="your-api-key"
)

# Example texts
full_text = "The cat is black and sleeps on the windowsill during sunny afternoons. It enjoys watching birds and occasionally naps in the garden."
summary = "The black cat sleeps by the window when it's sunny."

# Evaluate factual alignment
metrics = evaluate_summary(
full_text,
summary,
metrics=["factual_alignment"],
llm_config=llm_config
)

# Print results
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")

Verbose Mode

For detailed claim-level analysis, use the verbose parameter:

# Get detailed claim analysis
metrics = evaluate_summary(
full_text,
summary,
metrics=["factual_alignment"],
llm_config=llm_config,
verbose=True
)

# Access detailed results
print(f"Factual Alignment: {metrics['factual_alignment']:.4f}")
print(f"Coverage: {metrics['coverage']:.4f}")
print(f"Factual Consistency: {metrics['factual_consistency']:.4f}")
print(f"\nReference claims: {metrics['reference_claims_count']}")
print(f"Summary claims: {metrics['summary_claims_count']}")
print(f"Claims in summary: {metrics['claims_in_summary_count']}")
print(f"Supported claims: {metrics['supported_claims_count']}")
print(f"Unsupported claims: {metrics['unsupported_claims_count']}")

# Detailed claim-level analysis
if 'coverage_claims_analysis' in metrics:
print("\nCoverage Analysis:")
print(metrics['coverage_claims_analysis'])

if 'consistency_claims_analysis' in metrics:
print("\nConsistency Analysis:")
print(metrics['consistency_claims_analysis'])

Return Values

The metric returns a dictionary containing:

  • factual_alignment: F1 score combining coverage and factual_consistency (0-1)
  • coverage: Recall score (how much of source is in summary)
  • factual_consistency: Precision score (how much of summary is supported)
  • reference_claims_count: Total claims in reference
  • summary_claims_count: Total claims in summary
  • claims_in_summary_count: Source claims found in summary
  • supported_claims_count: Summary claims supported by source
  • unsupported_claims_count: Summary claims not supported by source
  • coverage_claims_analysis (only if verbose=True): Detailed coverage claim analysis
  • consistency_claims_analysis (only if verbose=True): Detailed consistency claim analysis

Interpretation

The factual alignment score ranges from 0 to 1:

  • 1.0: Perfect balance - the summary comprehensively covers source claims while being fully supported
  • 0.8-0.99: Excellent - minor gaps in coverage or support
  • 0.6-0.79: Good - reasonable balance but room for improvement
  • 0.4-0.59: Fair - significant issues with either coverage or accuracy
  • Below 0.4: Poor - major problems with completeness or factual grounding

Understanding the Components

  • High Coverage, Low Consistency: Summary includes most source information but adds unsupported claims (hallucinations)
  • High Consistency, Low Coverage: Summary is accurate but misses important information (incomplete)
  • High Factual Alignment: Both metrics are high, indicating a well-balanced summary

When to Use

Use factual alignment when:

  • You need a single, balanced metric for summary quality
  • Both completeness and accuracy are equally important
  • Evaluating abstractive summarization models
  • Comparing different summarization approaches
  • Ensuring summaries meet quality standards for production use

Custom Instructions

You can provide custom instructions to guide the LLM's evaluation:

custom_instruction = "Focus on technical accuracy and scientific claims when evaluating this medical summary."

metrics = evaluate_summary(
medical_text,
medical_summary,
metrics=["factual_alignment"],
llm_config=llm_config,
custom_instruction=custom_instruction
)

Limitations

  • Requires an LLM provider, which may incur costs
  • Results may vary depending on the LLM model used
  • The harmonic mean (F1) heavily penalizes imbalanced metrics
  • Complex or nuanced claims might be challenging to evaluate
  • Computational cost is higher than non-LLM metrics (runs both coverage and consistency checks)
  • Factual Consistency: Measures only precision (summary claims supported by source)
  • Coverage: Measures only recall (source claims included in summary)
  • Faithfulness: Semantic consistency between summary and source