BERT Score

BERT Score leverages pre-trained BERT contextual embeddings to compute similarity scores between candidate and reference texts. This metric provides a more nuanced evaluation by capturing semantic meaning beyond surface-level lexical matches.

Overview

BERT Score addresses limitations of traditional metrics by:

Using contextual embeddings instead of exact word matches
Considering semantic similarity rather than just lexical overlap
Providing token-level granularity in similarity assessment
Supporting multiple languages through multilingual BERT models

Components

BERT Score provides three main evaluation components:

Precision: Measures how well the candidate text tokens align with reference text tokens
Recall: Measures how well the reference text tokens are covered by candidate text tokens
F1: Harmonic mean of precision and recall, providing a balanced score

Available Models

BERT Score in our implementation supports two DeBERTa models:

microsoft/deberta-base-mnli:
- Smaller, faster model
- Good for general use cases
- Recommended when processing large volumes of text
- Lower memory requirements
- ~500mb download on first use
microsoft/deberta-xlarge-mnli (default):
- Larger, more powerful model
- Better performance but slower
- Recommended for high-stakes evaluations
- Higher memory requirements
- Default model if none specified
- ~3gb download on first use

To specify a model:

from assert_llm_tools.core import evaluate_summary

# Using base model
metrics = evaluate_summary(
    full_text,
    summary,
    metrics=["bert_score"],
    bert_model="microsoft/deberta-base-mnli"
)

# Using xlarge model (default)
metrics = evaluate_summary(
    full_text,
    summary,
    metrics=["bert_score"],
    bert_model="microsoft/deberta-xlarge-mnli"
)

Usage Example

from assert_llm_tools.core import evaluate_summary

metrics = ["bert_score"]

bert_model = "microsoft/deberta-xlarge-mnli"

full_text = "full text"
summary = "summary text"

metrics = evaluate_summary(
    full_text,
    summary,
    metrics=metrics,
    bert_model=bert_model,
)

print("\nEvaluation Metrics:")
for metric, score in metrics.items():
    print(f"{metric}: {score:.4f}")

Interpretation

Each component provides different insights:

Precision Score:
- Indicates how accurate the generated text is
- Higher scores mean better token-level matches
- Range: 0 to 1
Recall Score:
- Shows how well the reference content is covered
- Higher scores indicate better content preservation
- Range: 0 to 1
F1 Score:
- Balanced measure combining precision and recall
- Best for overall quality assessment
- Range: 0 to 1

For all metrics:

Higher scores indicate better semantic similarity
Scores are typically higher than n-gram based metrics
Context-aware matching captures paraphrases better

Limitations

Computationally more intensive than n-gram based metrics
Results can vary based on BERT model choice
May be sensitive to sentence structure
Requires more memory and processing power
Performance depends on domain similarity to BERT's training data

BERT Score

Overview​

Components​

Available Models​

Usage Example​

Interpretation​

Limitations​