Skip to main content

ASSERT LLM TOOLS

Automated Summary Scoring & Evaluation of Retained Text

A comprehensive toolkit for evaluating the quality of summaries generated by Large Language Models (LLMs).

Quick Start

# Basic installation
pip install assert_llm_tools

# With provider-specific features
pip install "assert_llm_tools[bedrock]" # For Amazon Bedrock
pip install "assert_llm_tools[openai]" # For OpenAI
pip install "assert_llm_tools[all]" # All features

Basic usage:

from assert_llm_tools.core import evaluate_summary
from assert_llm_tools.llm.config import LLMConfig

# Configure your LLM provider
config = LLMConfig(
provider="openai",
model_id="gpt-4",
api_key="your-api-key"
)

# Evaluate a summary
metrics = evaluate_summary(
full_text="Your source text here...",
summary="Your summary here...",
metrics=["rouge", "bleu", "bert_score"],
llm_config=config
)

Available Metrics

Summarisation Non-LLM Metrics

  • ROUGE Score (0-1): Measures overlap of n-grams between the reference text and generated summary
  • BLEU Score (0-1): Evaluates translation quality by comparing n-gram matches, with custom weights emphasizing unigrams and bigrams
  • BERT Score (0-1): Leverages contextual embeddings to better capture semantic similarity
  • BART Score (≤0): Uses BART's sequence-to-sequence model to evaluate semantic similarity and generation quality

Summarisation LLM-Based Metrics

  • Faithfulness (0-1): Measures factual consistency between summary and source text
  • Topic Preservation (0-1): Verifies that the most important topics from the source are retained in the summary
  • Redundancy Detection (0-1): Identifies and flags repeated information within summaries
  • Conciseness Assessment (0-1): Evaluates if the summary effectively condenses information without unnecessary verbosity

RAG Evaluation Metrics

  • Context Relevance (0-1): Evaluates how well the retrieved context matches the query
  • Answer Accuracy (0-1): Measures the factual correctness of the generated answer based on the provided context
  • Context Utilization (0-1): Assesses how effectively the model uses the provided context in generating the answer
  • Completeness (0-1): Evaluates whether the answer addresses all aspects of the query

Key Features

  • 🎯 Metric Selection: Choose specific metrics to evaluate
  • 🔍 Stopword Handling: Custom stopword filtering
  • 🤖 Multiple LLM Providers: Support for OpenAI and Amazon Bedrock
  • 📊 Progress Tracking: Visual progress during evaluation
  • 📈 Normalized Scores: All metrics scaled to 0-1 range (except BART Score)

Advanced Usage

Custom Stopwords

from assert_llm_tools.utils import add_custom_stopwords

add_custom_stopwords(["custom", "stopwords", "here"])
evaluate_summary(full_text, summary, remove_stopwords=True)

Metric Selection

metrics = evaluate_summary(
full_text,
summary,
metrics=["rouge", "bleu", "faithfulness"]
)

Provider Configuration

config = LLMConfig(
provider="bedrock",
model_id="anthropic.claude-v2",
region="us-east-1",
api_key="your-api-key",
api_secret="your-api-secret"
)

Important Notes

  • BERT Score requires model weights download on first use (~500MB for base model)
  • BART Score uses BART-large-CNN model (~1.6GB download on first use)
  • LLM-based metrics require appropriate provider configuration
  • All scores are normalized except BART Score (which is log-likelihood based)

Getting Help

Next Steps