ASSERT LLM TOOLS
Automated Summary Scoring & Evaluation of Retained Text
A comprehensive toolkit for evaluating the quality of summaries generated by Large Language Models (LLMs).
Quick Start
# Basic installation
pip install assert_llm_tools
# With provider-specific features
pip install "assert_llm_tools[bedrock]" # For Amazon Bedrock
pip install "assert_llm_tools[openai]" # For OpenAI
pip install "assert_llm_tools[all]" # All features
Basic usage:
from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.llm.config import LLMConfig
# Configure your LLM provider
config = LLMConfig(
provider="openai",
model_id="gpt-4",
api_key="your-api-key"
)
# Evaluate a summary
metrics = evaluate_summary(
full_text="Your source text here...",
summary="Your summary here...",
metrics=["rouge", "bleu", "bert_score"],
llm_config=config
)
Available Metrics
Summarisation Non-LLM Metrics
- ROUGE Score (0-1): Measures overlap of n-grams between the reference text and generated summary
- BLEU Score (0-1): Evaluates translation quality by comparing n-gram matches, with custom weights emphasizing unigrams and bigrams
- BERT Score (0-1): Leverages contextual embeddings to better capture semantic similarity
- BART Score (≤0): Uses BART's sequence-to-sequence model to evaluate semantic similarity and generation quality
Summarisation LLM-Based Metrics
- Faithfulness (0-1): Measures factual consistency between summary and source text
- Topic Preservation (0-1): Verifies that the most important topics from the source are retained in the summary
- Redundancy Detection (0-1): Identifies and flags repeated information within summaries
- Conciseness Assessment (0-1): Evaluates if the summary effectively condenses information without unnecessary verbosity
RAG Evaluation Metrics
- Context Relevance (0-1): Evaluates how well the retrieved context matches the query
- Answer Accuracy (0-1): Measures the factual correctness of the generated answer based on the provided context
- Context Utilization (0-1): Assesses how effectively the model uses the provided context in generating the answer
- Completeness (0-1): Evaluates whether the answer addresses all aspects of the query
Key Features
- 🎯 Metric Selection: Choose specific metrics to evaluate
- 🔍 Stopword Handling: Custom stopword filtering
- 🤖 Multiple LLM Providers: Support for OpenAI and Amazon Bedrock
- 📊 Progress Tracking: Visual progress during evaluation
- 📈 Normalized Scores: All metrics scaled to 0-1 range (except BART Score)
Advanced Usage
Custom Stopwords
from assert_llm_tools.utils import add_custom_stopwords
add_custom_stopwords(["custom", "stopwords", "here"])
evaluate_summary(full_text, summary, remove_stopwords=True)
Metric Selection
metrics = evaluate_summary(
full_text,
summary,
metrics=["rouge", "bleu", "faithfulness"]
)
Provider Configuration
config = LLMConfig(
provider="bedrock",
model_id="anthropic.claude-v2",
region="us-east-1",
api_key="your-api-key",
api_secret="your-api-secret"
)
Important Notes
- BERT Score requires model weights download on first use (~500MB for base model)
- BART Score uses BART-large-CNN model (~1.6GB download on first use)
- LLM-based metrics require appropriate provider configuration
- All scores are normalized except BART Score (which is log-likelihood based)