BERT Score
BERT Score leverages pre-trained BERT contextual embeddings to compute similarity scores between candidate and reference texts. This metric provides a more nuanced evaluation by capturing semantic meaning beyond surface-level lexical matches.
Overview
BERT Score addresses limitations of traditional metrics by:
- Using contextual embeddings instead of exact word matches
- Considering semantic similarity rather than just lexical overlap
- Providing token-level granularity in similarity assessment
- Supporting multiple languages through multilingual BERT models
Components
BERT Score provides three main evaluation components:
- Precision: Measures how well the candidate text tokens align with reference text tokens
- Recall: Measures how well the reference text tokens are covered by candidate text tokens
- F1: Harmonic mean of precision and recall, providing a balanced score
Available Models
BERT Score in our implementation supports two DeBERTa models:
microsoft/deberta-base-mnli:
- Smaller, faster model
- Good for general use cases
- Recommended when processing large volumes of text
- Lower memory requirements
- ~500mb download on first use
microsoft/deberta-xlarge-mnli (default):
- Larger, more powerful model
- Better performance but slower
- Recommended for high-stakes evaluations
- Higher memory requirements
- Default model if none specified
- ~3gb download on first use
To specify a model:
from assert_llm_tools.core import evaluate_summary
# Using base model
metrics = evaluate_summary(
full_text,
summary,
metrics=["bert_score"],
bert_model="microsoft/deberta-base-mnli"
)
# Using xlarge model (default)
metrics = evaluate_summary(
full_text,
summary,
metrics=["bert_score"],
bert_model="microsoft/deberta-xlarge-mnli"
)
Usage Example
from assert_llm_tools.core import evaluate_summary
metrics = ["bert_score"]
bert_model = "microsoft/deberta-xlarge-mnli"
full_text = "full text"
summary = "summary text"
metrics = evaluate_summary(
full_text,
summary,
metrics=metrics,
bert_model=bert_model,
)
print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")
Interpretation
Each component provides different insights:
Precision Score:
- Indicates how accurate the generated text is
- Higher scores mean better token-level matches
- Range: 0 to 1
Recall Score:
- Shows how well the reference content is covered
- Higher scores indicate better content preservation
- Range: 0 to 1
F1 Score:
- Balanced measure combining precision and recall
- Best for overall quality assessment
- Range: 0 to 1
For all metrics:
- Higher scores indicate better semantic similarity
- Scores are typically higher than n-gram based metrics
- Context-aware matching captures paraphrases better
Limitations
- Computationally more intensive than n-gram based metrics
- Results can vary based on BERT model choice
- May be sensitive to sentence structure
- Requires more memory and processing power
- Performance depends on domain similarity to BERT's training data