Answer Relevance
Answer Relevance measures how well a generated answer addresses the original question. It evaluates the semantic relationship between the question and answer, considering both topical alignment and the completeness of the response.
Overview
The answer relevance metric uses LLM-based evaluation to assess how well an answer addresses the given question. The evaluation considers:
- Direct relevance to the question
- Completeness of the response
- Presence of irrelevant information
Usage
Here's how to evaluate answer relevance using Assert LLM Tools:
from assert_llm_tools.core import evaluate_rag
from assert_llm_tools.llm.config import LLMConfig
# Configure LLM provider (choose one)
llm_config = LLMConfig(
provider="bedrock", # Default provider
model_id="anthropic.claude-v2", # Default model
region="us-east-1"
)
# Or use OpenAI
llm_config = LLMConfig(
provider="openai",
model_id="gpt-4",
api_key="your-api-key"
)
# Example evaluation
question = "What are the health benefits of green tea?"
answer = "Green tea contains antioxidants called catechins that boost immunity and may help prevent cancer."
context = ["Green tea contains antioxidants called catechins that boost immunity and may help prevent cancer.", "Green tea is a type of tea made from the leaves of the Camellia sinensis plant."]
# Evaluate answer relevance
results = evaluate_rag(
question,
answer,
context,
llm_config=llm_config,
metrics=["answer_relevance"]
)
# Print results
print(results) # Returns {"answer_relevance": score}
Interpretation
The answer relevance score ranges from 0 to 1:
- 0.0: Completely irrelevant - The answer has no connection to the question
- 0.5: Partially relevant - The answer addresses some aspects but misses key points or includes irrelevant information
- 1.0: Highly relevant - The answer directly addresses the question
When to Use
Use answer relevance metrics when:
- Evaluating question-answering systems
- Assessing chatbot response quality
- Measuring the effectiveness of RAG systems
- Testing model comprehension and response accuracy
Limitations
- Relies on LLM judgment which may have inherent biases
- Currently supports only Bedrock (Claude) and OpenAI providers
- Evaluation is based on direct question-answer relationship without considering additional context
- Results may vary slightly between different LLM providers or models