Skip to main content

Answer Attribution

Answer Attribution measures how much of a generated answer appears to be derived from the provided context. It combines multiple approaches including semantic similarity, n-gram overlap, and LLM-based evaluation to produce a comprehensive attribution score.

Overview

The answer attribution metric uses three key components to evaluate the relationship between an answer and its source context:

  1. Embedding similarity using sentence transformers
  2. N-gram overlap analysis
  3. LLM-based evaluation of context usage

This combined approach helps identify both semantic and literal relationships between the answer and context.

Usage

Here's how to evaluate answer attribution using Assert LLM Tools:

from assert_llm_tools.core import evaluate_rag
from assert_llm_tools.llm.config import LLMConfig

# Configure LLM provider (choose one)
llm_config = LLMConfig(
provider="bedrock",
model_id="anthropic.claude-v2",
region="us-east-1"
)

llm_config = LLMConfig(
provider="openai",
model_id="gpt-4",
api_key="your-api-key"
)

# Example texts
question= "What is the Eiffel Tower?"
context = "The Eiffel Tower was completed in 1889. It stands 324 meters tall and is located in Paris, France."
answer = "The Eiffel Tower, located in Paris, was completed in 1889 and reaches a height of 324 meters."

# Evaluate answer attribution
results = evaluate_rag(
question,
answer,
context,
llm_config=llm_config,
metrics=["answer_attribution"]
)

# Print results
print(results)

Interpretation

The answer attribution score ranges from 0 to 1:

  • 1.0: Perfect attribution (answer completely derived from context)
  • 0.5: Partial attribution (answer uses context but includes external information)
  • 0.0: No attribution (answer shows no evidence of using the context)

When to Use

Use answer attribution metrics when:

  • Evaluating RAG (Retrieval-Augmented Generation) systems
  • Detecting potential hallucinations in generated answers
  • Measuring how effectively context is being utilized in responses
  • Assessing the reliability of AI-generated answers

Limitations

  • Requires an LLM provider for part of the evaluation, which may incur costs
  • Results can vary based on the LLM model used
  • May not fully capture complex reasoning or implicit connections
  • The embedding and n-gram components