Factual Alignment

Factual Alignment provides a balanced measure of summary quality by calculating the F1 score that combines coverage (recall) and factual consistency (precision).

Overview

This metric addresses a key challenge in summarization evaluation: balancing completeness and accuracy. A good summary should both capture the important information from the source (coverage) and ensure that everything it states is factually supported (consistency). The F1 score provides a single metric that achieves this balance.

How It Works

Factual Alignment combines two complementary metrics:

Coverage (Recall): What percentage of source claims appear in the summary
Factual Consistency (Precision): What percentage of summary claims are supported by the source

The F1 score is the harmonic mean of these two metrics:

F1 = 2 × (precision × recall) / (precision + recall)

This is particularly useful when you want to ensure summaries are both comprehensive and factually grounded, preventing both information loss and hallucinations.

Usage

Here's how to evaluate factual alignment using Assert LLM Tools:

from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.llm.config import LLMConfig

# Configure LLM provider (choose one)
llm_config = LLMConfig(
    provider="bedrock",
    model_id="anthropic.claude-v2",
    region="us-east-1"
)

llm_config = LLMConfig(
    provider="openai",
    model_id="gpt-4-mini",
    api_key="your-api-key"
)

# Example texts
full_text = "The cat is black and sleeps on the windowsill during sunny afternoons. It enjoys watching birds and occasionally naps in the garden."
summary = "The black cat sleeps by the window when it's sunny."

# Evaluate factual alignment
metrics = evaluate_summary(
    full_text,
    summary,
    metrics=["factual_alignment"],
    llm_config=llm_config
)

# Print results
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
    print(f"{metric}: {score:.4f}")

Verbose Mode

For detailed claim-level analysis, use the verbose parameter:

# Get detailed claim analysis
metrics = evaluate_summary(
    full_text,
    summary,
    metrics=["factual_alignment"],
    llm_config=llm_config,
    verbose=True
)

# Access detailed results
print(f"Factual Alignment: {metrics['factual_alignment']:.4f}")
print(f"Coverage: {metrics['coverage']:.4f}")
print(f"Factual Consistency: {metrics['factual_consistency']:.4f}")
print(f"\nReference claims: {metrics['reference_claims_count']}")
print(f"Summary claims: {metrics['summary_claims_count']}")
print(f"Claims in summary: {metrics['claims_in_summary_count']}")
print(f"Supported claims: {metrics['supported_claims_count']}")
print(f"Unsupported claims: {metrics['unsupported_claims_count']}")

# Detailed claim-level analysis
if 'coverage_claims_analysis' in metrics:
    print("\nCoverage Analysis:")
    print(metrics['coverage_claims_analysis'])

if 'consistency_claims_analysis' in metrics:
    print("\nConsistency Analysis:")
    print(metrics['consistency_claims_analysis'])

Return Values

The metric returns a dictionary containing:

factual_alignment: F1 score combining coverage and factual_consistency (0-1)
coverage: Recall score (how much of source is in summary)
factual_consistency: Precision score (how much of summary is supported)
reference_claims_count: Total claims in reference
summary_claims_count: Total claims in summary
claims_in_summary_count: Source claims found in summary
supported_claims_count: Summary claims supported by source
unsupported_claims_count: Summary claims not supported by source
coverage_claims_analysis (only if verbose=True): Detailed coverage claim analysis
consistency_claims_analysis (only if verbose=True): Detailed consistency claim analysis

Interpretation

The factual alignment score ranges from 0 to 1:

1.0: Perfect balance - the summary comprehensively covers source claims while being fully supported
0.8-0.99: Excellent - minor gaps in coverage or support
0.6-0.79: Good - reasonable balance but room for improvement
0.4-0.59: Fair - significant issues with either coverage or accuracy
Below 0.4: Poor - major problems with completeness or factual grounding

Understanding the Components

High Coverage, Low Consistency: Summary includes most source information but adds unsupported claims (hallucinations)
High Consistency, Low Coverage: Summary is accurate but misses important information (incomplete)
High Factual Alignment: Both metrics are high, indicating a well-balanced summary

When to Use

Use factual alignment when:

You need a single, balanced metric for summary quality
Both completeness and accuracy are equally important
Evaluating abstractive summarization models
Comparing different summarization approaches
Ensuring summaries meet quality standards for production use

Custom Instructions

You can provide custom instructions to guide the LLM's evaluation:

custom_instruction = "Focus on technical accuracy and scientific claims when evaluating this medical summary."

metrics = evaluate_summary(
    medical_text,
    medical_summary,
    metrics=["factual_alignment"],
    llm_config=llm_config,
    custom_instruction=custom_instruction
)

Limitations

Requires an LLM provider, which may incur costs
Results may vary depending on the LLM model used
The harmonic mean (F1) heavily penalizes imbalanced metrics
Complex or nuanced claims might be challenging to evaluate
Computational cost is higher than non-LLM metrics (runs both coverage and consistency checks)

Factual Consistency: Measures only precision (summary claims supported by source)
Coverage: Measures only recall (source claims included in summary)
Faithfulness: Semantic consistency between summary and source

Factual Alignment

Overview​

How It Works​

Usage​

Verbose Mode​

Return Values​

Interpretation​

Understanding the Components​

When to Use​

Custom Instructions​

Limitations​

Related Metrics​