Skip to main content

BERT Score

BERT Score leverages pre-trained BERT contextual embeddings to compute similarity scores between candidate and reference texts. This metric provides a more nuanced evaluation by capturing semantic meaning beyond surface-level lexical matches.

Overview

BERT Score addresses limitations of traditional metrics by:

  • Using contextual embeddings instead of exact word matches
  • Considering semantic similarity rather than just lexical overlap
  • Providing token-level granularity in similarity assessment
  • Supporting multiple languages through multilingual BERT models

Components

BERT Score provides three main evaluation components:

  • Precision: Measures how well the candidate text tokens align with reference text tokens
  • Recall: Measures how well the reference text tokens are covered by candidate text tokens
  • F1: Harmonic mean of precision and recall, providing a balanced score

Available Models

BERT Score in our implementation supports two DeBERTa models:

  1. microsoft/deberta-base-mnli:

    • Smaller, faster model
    • Good for general use cases
    • Recommended when processing large volumes of text
    • Lower memory requirements
    • ~500mb download on first use
  2. microsoft/deberta-xlarge-mnli (default):

    • Larger, more powerful model
    • Better performance but slower
    • Recommended for high-stakes evaluations
    • Higher memory requirements
    • Default model if none specified
    • ~3gb download on first use

To specify a model:

from assert_llm_tools.core import evaluate_summary

# Using base model
metrics = evaluate_summary(
full_text,
summary,
metrics=["bert_score"],
bert_model="microsoft/deberta-base-mnli"
)

# Using xlarge model (default)
metrics = evaluate_summary(
full_text,
summary,
metrics=["bert_score"],
bert_model="microsoft/deberta-xlarge-mnli"
)

Usage Example

from assert_llm_tools.core import evaluate_summary

metrics = ["bert_score"]

bert_model = "microsoft/deberta-xlarge-mnli"

full_text = "full text"
summary = "summary text"

metrics = evaluate_summary(
full_text,
summary,
metrics=metrics,
bert_model=bert_model,
)

print("\nEvaluation Metrics:")
for metric, score in metrics.items():
print(f"{metric}: {score:.4f}")

Interpretation

Each component provides different insights:

  • Precision Score:

    • Indicates how accurate the generated text is
    • Higher scores mean better token-level matches
    • Range: 0 to 1
  • Recall Score:

    • Shows how well the reference content is covered
    • Higher scores indicate better content preservation
    • Range: 0 to 1
  • F1 Score:

    • Balanced measure combining precision and recall
    • Best for overall quality assessment
    • Range: 0 to 1

For all metrics:

  • Higher scores indicate better semantic similarity
  • Scores are typically higher than n-gram based metrics
  • Context-aware matching captures paraphrases better

Limitations

  • Computationally more intensive than n-gram based metrics
  • Results can vary based on BERT model choice
  • May be sensitive to sentence structure
  • Requires more memory and processing power
  • Performance depends on domain similarity to BERT's training data