Answer Relevance

Answer Relevance measures how well a generated answer addresses the original question. It evaluates the semantic relationship between the question and answer, considering both topical alignment and the completeness of the response.

Overview

The answer relevance metric uses LLM-based evaluation to assess how well an answer addresses the given question. The evaluation considers:

Direct relevance to the question
Completeness of the response
Presence of irrelevant information

Usage

Here's how to evaluate answer relevance using Assert LLM Tools:

from assert_llm_tools.core import evaluate_rag
from assert_llm_tools.llm.config import LLMConfig

# Configure LLM provider (choose one)
llm_config = LLMConfig(
    provider="bedrock",  # Default provider
    model_id="anthropic.claude-v2",  # Default model
    region="us-east-1"
)

# Or use OpenAI
llm_config = LLMConfig(
    provider="openai",
    model_id="gpt-4",
    api_key="your-api-key"
)

# Example evaluation
question = "What are the health benefits of green tea?"
answer = "Green tea contains antioxidants called catechins that boost immunity and may help prevent cancer."
context = ["Green tea contains antioxidants called catechins that boost immunity and may help prevent cancer.", "Green tea is a type of tea made from the leaves of the Camellia sinensis plant."]

# Evaluate answer relevance
results = evaluate_rag(
    question,
    answer,
    context,
    llm_config=llm_config,
    metrics=["answer_relevance"]
)

# Print results
print(results)  # Returns {"answer_relevance": score}

Interpretation

The answer relevance score ranges from 0 to 1:

0.0: Completely irrelevant - The answer has no connection to the question
0.5: Partially relevant - The answer addresses some aspects but misses key points or includes irrelevant information
1.0: Highly relevant - The answer directly addresses the question

When to Use

Use answer relevance metrics when:

Evaluating question-answering systems
Assessing chatbot response quality
Measuring the effectiveness of RAG systems
Testing model comprehension and response accuracy

Limitations

Relies on LLM judgment which may have inherent biases
Currently supports only Bedrock (Claude) and OpenAI providers
Evaluation is based on direct question-answer relationship without considering additional context
Results may vary slightly between different LLM providers or models

Answer Relevance

Overview​

Usage​

Interpretation​

When to Use​

Limitations​

Overview

Usage

Interpretation

When to Use

Limitations