Factual Alignment
Factual Alignment provides a balanced measure of summary quality by calculating the F1 score that combines coverage (recall) and factual consistency (precision).
Overview
This metric addresses a key challenge in summarization evaluation: balancing completeness and accuracy. A good summary should both capture the important information from the source (coverage) and ensure that everything it states is factually supported (consistency). The F1 score provides a single metric that achieves this balance.
How It Works
Factual Alignment combines two complementary metrics:
- Coverage (Recall): What percentage of source claims appear in the summary
- Factual Consistency (Precision): What percentage of summary claims are supported by the source
The F1 score is the harmonic mean of these two metrics:
F1 = 2 × (precision × recall) / (precision + recall)
This is particularly useful when you want to ensure summaries are both comprehensive and factually grounded, preventing both information loss and hallucinations.
Usage
Here's how to evaluate factual alignment using Assert LLM Tools:
from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.llm.config import LLMConfig
# Configure LLM provider (choose one)
llm_config = LLMConfig(
provider="bedrock",
model_id="anthropic.claude-v2",
region="us-east-1"
)
llm_config = LLMConfig(
provider="openai",
model_id="gpt-4-mini",
api_key="your-api-key"
)
# Example texts
full_text = "The cat is black and sleeps on the windowsill during sunny afternoons. It enjoys watching birds and occasionally naps in the garden."
summary = "The black cat sleeps by the window when it's sunny."
# Evaluate factual alignment
metrics = evaluate_summary(
full_text,
summary,
metrics=["factual_alignment"],
llm_config=llm_config
)
# Print results
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")
Verbose Mode
For detailed claim-level analysis, use the verbose parameter:
# Get detailed claim analysis
metrics = evaluate_summary(
full_text,
summary,
metrics=["factual_alignment"],
llm_config=llm_config,
verbose=True
)
# Access detailed results
print(f"Factual Alignment: {metrics['factual_alignment']:.4f}")
print(f"Coverage: {metrics['coverage']:.4f}")
print(f"Factual Consistency: {metrics['factual_consistency']:.4f}")
print(f"\nReference claims: {metrics['reference_claims_count']}")
print(f"Summary claims: {metrics['summary_claims_count']}")
print(f"Claims in summary: {metrics['claims_in_summary_count']}")
print(f"Supported claims: {metrics['supported_claims_count']}")
print(f"Unsupported claims: {metrics['unsupported_claims_count']}")
# Detailed claim-level analysis
if 'coverage_claims_analysis' in metrics:
print("\nCoverage Analysis:")
print(metrics['coverage_claims_analysis'])
if 'consistency_claims_analysis' in metrics:
print("\nConsistency Analysis:")
print(metrics['consistency_claims_analysis'])
Return Values
The metric returns a dictionary containing:
factual_alignment: F1 score combining coverage and factual_consistency (0-1)coverage: Recall score (how much of source is in summary)factual_consistency: Precision score (how much of summary is supported)reference_claims_count: Total claims in referencesummary_claims_count: Total claims in summaryclaims_in_summary_count: Source claims found in summarysupported_claims_count: Summary claims supported by sourceunsupported_claims_count: Summary claims not supported by sourcecoverage_claims_analysis(only if verbose=True): Detailed coverage claim analysisconsistency_claims_analysis(only if verbose=True): Detailed consistency claim analysis
Interpretation
The factual alignment score ranges from 0 to 1:
- 1.0: Perfect balance - the summary comprehensively covers source claims while being fully supported
- 0.8-0.99: Excellent - minor gaps in coverage or support
- 0.6-0.79: Good - reasonable balance but room for improvement
- 0.4-0.59: Fair - significant issues with either coverage or accuracy
- Below 0.4: Poor - major problems with completeness or factual grounding
Understanding the Components
- High Coverage, Low Consistency: Summary includes most source information but adds unsupported claims (hallucinations)
- High Consistency, Low Coverage: Summary is accurate but misses important information (incomplete)
- High Factual Alignment: Both metrics are high, indicating a well-balanced summary
When to Use
Use factual alignment when:
- You need a single, balanced metric for summary quality
- Both completeness and accuracy are equally important
- Evaluating abstractive summarization models
- Comparing different summarization approaches
- Ensuring summaries meet quality standards for production use
Custom Instructions
You can provide custom instructions to guide the LLM's evaluation:
custom_instruction = "Focus on technical accuracy and scientific claims when evaluating this medical summary."
metrics = evaluate_summary(
medical_text,
medical_summary,
metrics=["factual_alignment"],
llm_config=llm_config,
custom_instruction=custom_instruction
)
Limitations
- Requires an LLM provider, which may incur costs
- Results may vary depending on the LLM model used
- The harmonic mean (F1) heavily penalizes imbalanced metrics
- Complex or nuanced claims might be challenging to evaluate
- Computational cost is higher than non-LLM metrics (runs both coverage and consistency checks)
Related Metrics
- Factual Consistency: Measures only precision (summary claims supported by source)
- Coverage: Measures only recall (source claims included in summary)
- Faithfulness: Semantic consistency between summary and source