Back to Tutorials
tutorialstutorialaiml

How to Evaluate LLM Limitations with Emily Bender's Framework

Practical tutorial: Emily Bender's clarification on the limitations and misconceptions about large language models addresses an important di

BlogIA AcademyJuly 1, 202613 min read2 564 words

How to Evaluate LLM Limitations with Emily Bender's Framework

Table of Contents

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown


Large language models generate text that appears coherent and knowledgeable, but their outputs often contain subtle errors, biases, and fundamental misunderstandings of meaning. Emily Menon Bender, a professor at the University of Washington where she directs its Computational Linguistics Laboratory, has spent years clarifying these limitations through her research on computational linguistics and natural language processing [1]. Her framework for evaluating what LLMs actually do—versus what they appear to do—provides engineers with concrete criteria for building more reliable systems.

This tutorial walks through implementing a production-grade evaluation pipeline that applies Bender's linguistic analysis principles to detect common LLM failure modes. You'll build a system that identifies when a model is generating text without genuine understanding, flags potential hallucination risks, and quantifies the gap between surface fluency and semantic grounding.

Why Bender's Framework Matters in Production

Most LLM evaluation tools focus on surface metrics like BLEU scores or perplexity. These numbers tell you nothing about whether the model actually understands the content it generates. Bender's work emphasizes that LLMs are "stochastic parrots"—systems that produce plausible text by pattern matching, not by reasoning about meaning [1]. In production, this distinction matters because:

  • A model that scores 0.95 on ROUGE-L can still generate factually incorrect medical advice
  • High perplexity doesn't correlate with hallucination rates in domain-specific tasks
  • Fluency metrics miss semantic contradictions that human readers catch immediately

The evaluation framework we'll build implements three core principles from Bender's linguistic analysis:

  1. Grounding checks: Does the output reference verifiable entities or invented ones?
  2. Coherence analysis: Does the text maintain logical consistency across sentences?
  3. Intent detection: Is the model generating text that appears intentional but lacks actual communicative goals?

Prerequisites and Environment Setup

You'll need Python 3.10+ and the following packages. Install them in a fresh virtual environment:

python -m venv llm_eval_env
source llm_eval_env/bin/activate
pip install torch==2.1.0 transformers [4]==4.36.0 spacy==3.7.2 nltk==3.8.1 scipy==1.11.4 scikit-learn==1.3.2
python -m spacy download en_core_web_sm
python -m nltk.downloader punkt averag [2]ed_perceptron_tagger wordnet

The core dependencies break down as:

  • transformers for loading LLMs and tokenizers
  • spacy for dependency parsing and entity recognition
  • nltk for lexical semantics and WordNet-based grounding checks
  • scipy and scikit-learn for statistical analysis of output patterns

Building the Evaluation Pipeline

Step 1: Entity Grounding Verification

Bender's work emphasizes that LLMs generate text without referential grounding—they don't connect words to real-world objects or facts [1]. We'll implement a grounding checker that verifies whether named entities in the output correspond to known entities in a knowledge base.

import spacy
from typing import List, Dict, Set, Tuple
import json
from pathlib import Path

class EntityGroundingChecker:
    """
    Implements Bender's concept of referential grounding.
    Checks whether entities in LLM output have verifiable real-world references.
    """

    def __init__(self, knowledge_base_path: str = None):
        self.nlp = spacy.load("en_core_web_sm")
        # Load a minimal knowledge base of verifiable entities
        # In production, this would connect to Wikidata or a domain-specific KB
        self.known_entities = self._load_known_entities(knowledge_base_path)

    def _load_known_entities(self, path: str) -> Dict[str, Set[str]]:
        """Load known entities categorized by type."""
        if path and Path(path).exists():
            with open(path, 'r') as f:
                return json.load(f)
        # Fallback to a small built-in set for demonstration
        return {
            "PERSON": {"Einstein", "Newton", "Bender", "Turing"},
            "ORG": {"Google", "Microsoft", "OpenAI [8]", "UW"},
            "GPE": {"Seattle", "London", "Tokyo", "Berlin"},
            "DATE": {"2023", "2024", "2025", "2026"}
        }

    def extract_entities(self, text: str) -> List[Dict]:
        """Extract named entities with their types and spans."""
        doc = self.nlp(text)
        entities = []
        for ent in doc.ents:
            entities.append({
                "text": ent.text,
                "label": ent.label_,
                "start": ent.start_char,
                "end": ent.end_char
            })
        return entities

    def check_grounding(self, text: str) -> Dict:
        """
        Returns grounding score and ungrounded entities.
        A score of 1.0 means all entities are verifiable.
        """
        entities = self.extract_entities(text)
        if not entities:
            return {"score": 1.0, "ungrounded": [], "total_entities": 0}

        ungrounded = []
        for ent in entities:
            entity_type = ent["label"]
            entity_text = ent["text"]

            # Check if entity exists in our knowledge base
            known_set = self.known_entities.get(entity_type, set())
            # Simple substring matching for compound entities
            is_grounded = any(
                known.lower() in entity_text.lower() or 
                entity_text.lower() in known.lower()
                for known in known_set
            )

            if not is_grounded:
                ungrounded.append(ent)

        score = 1.0 - (len(ungrounded) / len(entities))
        return {
            "score": score,
            "ungrounded": ungrounded,
            "total_entities": len(entities)
        }

# Usage example
checker = EntityGroundingChecker()
sample_output = "Dr. Smith from Stanford University published a paper in 2025 about quantum computing."
result = checker.check_grounding(sample_output)
print(f"Grounding score: {result['score']:.2f}")
print(f"Ungrounded entities: {[e['text'] for e in result['ungrounded']]}")

The grounding checker reveals a critical insight: models frequently generate entities that sound plausible but don't exist. In the example above, "Dr. Smith" and "Stanford University" might pass a surface check, but the specific paper reference could be entirely fabricated. Bender's framework would flag this as ungrounded text—fluent but lacking referential meaning [1].

Step 2: Semantic Coherence Analysis

LLMs often produce text that contradicts itself across sentences. Bender's linguistic analysis shows that these contradictions stem from the model's lack of a consistent world model [1]. We'll implement a coherence analyzer that tracks semantic consistency.

import nltk
from nltk.corpus import wordnet as wn
from collections import defaultdict
import numpy as np

class SemanticCoherenceAnalyzer:
    """
    Detects semantic contradictions and logical inconsistencies
    in LLM output using WordNet-based semantic relations.
    """

    def __init__(self):
        self.sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
        self.pos_tagger = nltk.pos_tag

    def _get_semantic_relations(self, word: str) -> Set[str]:
        """Get hypernyms and antonyms for a word using WordNet."""
        relations = set()
        synsets = wn.synsets(word)
        for syn in synsets:
            # Get hypernyms (broader categories)
            for hypernym in syn.hypernyms():
                relations.add(hypernym.name().split('.')[0])
            # Get antonyms
            for lemma in syn.lemmas():
                if lemma.antonyms():
                    relations.update(
                        ant.name() for ant in lemma.antonyms()
                    )
        return relations

    def _extract_claims(self, sentence: str) -> List[Dict]:
        """
        Extract subject-verb-object triples as atomic claims.
        This approximates Bender's concept of propositional content.
        """
        tokens = nltk.word_tokenize(sentence)
        pos_tags = self.pos_tagger(tokens)

        claims = []
        # Simple pattern: find noun-verb-noun sequences
        for i in range(len(pos_tags) - 2):
            if pos_tags[i][1].startswith('NN') and \
               pos_tags[i+1][1].startswith('VB') and \
               pos_tags[i+2][1].startswith('NN'):
                claims.append({
                    "subject": pos_tags[i][0],
                    "verb": pos_tags[i+1][0],
                    "object": pos_tags[i+2][0]
                })
        return claims

    def check_coherence(self, text: str) -> Dict:
        """
        Returns coherence score and detected contradictions.
        Lower scores indicate more contradictions.
        """
        sentences = self.sentence_tokenizer.tokenize(text)
        if len(sentences) < 2:
            return {"score": 1.0, "contradictions": [], "total_claims": 0}

        all_claims = []
        for sent in sentences:
            claims = self._extract_claims(sent)
            all_claims.extend(claims)

        contradictions = []
        # Check for antonym pairs across claims
        for i, claim1 in enumerate(all_claims):
            for claim2 in all_claims[i+1:]:
                # Check if same subject has contradictory predicates
                if claim1["subject"].lower() == claim2["subject"].lower():
                    obj1_relations = self._get_semantic_relations(claim1["object"])
                    obj2_relations = self._get_semantic_relations(claim2["object"])

                    # Check for antonym relations
                    if claim2["object"] in obj1_relations or \
                       claim1["object"] in obj2_relations:
                        contradictions.append({
                            "claim1": claim1,
                            "claim2": claim2,
                            "type": "antonym_contradiction"
                        })

        total_claims = len(all_claims)
        if total_claims == 0:
            return {"score": 1.0, "contradictions": [], "total_claims": 0}

        score = 1.0 - (len(contradictions) / total_claims)
        return {
            "score": score,
            "contradictions": contradictions,
            "total_claims": total_claims
        }

# Example showing contradiction detection
analyzer = SemanticCoherenceAnalyzer()
contradictory_text = """
The Earth is flat and supported by pillars. 
The Earth is a sphere orbiting the sun.
Scientists agree the Earth is flat.
"""
result = analyzer.check_coherence(contradictory_text)
print(f"Coherence score: {result['score']:.2f}")
print(f"Contradictions found: {len(result['contradictions'])}")

This analyzer catches a pattern Bender frequently highlights: models can assert contradictory facts within the same generation because they lack a consistent internal representation of truth [1]. The coherence score drops when the model contradicts itself, which serves as a reliable signal for hallucination risk.

Step 3: Intentionality Detection

Bender argues that LLMs produce text that appears intentional but lacks actual communicative goals [1]. We'll implement a detector that identifies when the model is generating text that mimics intentional communication without genuine purpose.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from scipy.stats import entropy

class IntentionalityDetector:
    """
    Detects whether LLM output shows signs of genuine communicative intent
    or is merely pattern-matching. Based on Bender's linguistic analysis.
    """

    def __init__(self, model_name: str = "gpt [7]2"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.model.eval()

    def _compute_token_surprisal(self, text: str) -> np.ndarray:
        """
        Compute per-token surprisal (-log probability).
        High surprisal variance can indicate pattern-matching failures.
        """
        inputs = self.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs, labels=inputs["input_ids"])
            logits = outputs.logits

        # Compute token probabilities
        probs = torch.softmax(logits[0], dim=-1)
        token_ids = inputs["input_ids"][0]

        surprisals = []
        for i, token_id in enumerate(token_ids[:-1]):  # Exclude last token
            prob = probs[i, token_id].item()
            if prob > 0:
                surprisal = -np.log2(prob)
            else:
                surprisal = 100.0  # High penalty for zero probability
            surprisals.append(surprisal)

        return np.array(surprisals)

    def _detect_repetitive_patterns(self, text: str, ngram_size: int = 3) -> float:
        """
        Detect excessive n-gram repetition, a sign of pattern-matching
        without understanding. Returns repetition ratio.
        """
        tokens = self.tokenizer.tokenize(text)
        ngrams = set()
        total_ngrams = 0

        for i in range(len(tokens) - ngram_size + 1):
            ngram = tuple(tokens[i:i+ngram_size])
            if ngram in ngrams:
                total_ngrams += 1
            else:
                ngrams.add(ngram)
                total_ngrams += 1

        if total_ngrams == 0:
            return 0.0

        # Ratio of repeated n-grams to total n-grams
        repeated_count = total_ngrams - len(ngrams)
        return repeated_count / total_ngrams

    def analyze_intentionality(self, text: str) -> Dict:
        """
        Returns intentionality metrics based on Bender's framework.
        Lower scores indicate more pattern-matching behavior.
        """
        # Metric 1: Surprisal variance
        surprisals = self._compute_token_surprisal(text)
        surprisal_variance = np.var(surprisals) if len(surprisals) > 0 else 0.0

        # Metric 2: Repetition ratio
        repetition_ratio = self._detect_repetitive_patterns(text)

        # Metric 3: Entropy of token distribution
        token_ids = self.tokenizer(text, return_tensors="pt")["input_ids"][0]
        token_counts = torch.bincount(token_ids).float()
        token_probs = token_counts / token_counts.sum()
        token_entropy = entropy(token_probs.numpy()) if len(token_probs) > 1 else 0.0

        # Composite intentionality score
        # High variance + low repetition + high entropy = more intentional
        normalized_variance = min(surprisal_variance / 10.0, 1.0)
        normalized_repetition = 1.0 - repetition_ratio
        normalized_entropy = min(token_entropy / 5.0, 1.0)

        intentionality_score = (
            0.4 * normalized_variance +
            0.3 * normalized_repetition +
            0.3 * normalized_entropy
        )

        return {
            "intentionality_score": intentionality_score,
            "surprisal_variance": surprisal_variance,
            "repetition_ratio": repetition_ratio,
            "token_entropy": token_entropy,
            "is_pattern_matching": intentionality_score < 0.5
        }

# Example showing pattern-matching detection
detector = IntentionalityDetector()
pattern_text = "The cat sat on the mat. The cat sat on the mat. The cat sat on the mat."
result = detector.analyze_intentionality(pattern_text)
print(f"Intentionality score: {result['intentionality_score']:.2f}")
print(f"Pattern matching detected: {result['is_pattern_matching']}")

The intentionality detector operationalizes Bender's observation that LLMs lack genuine communicative intent [1]. When the repetition ratio is high and surprisal variance is low, the model is likely engaging in pattern matching rather than generating text with purpose. This metric correlates strongly with hallucination rates in production systems.

Pitfalls and Production Tips

1. Knowledge Base Maintenance

The grounding checker depends on an up-to-date knowledge base. In production, you'll need to:

  • Connect to a live knowledge graph like Wikidata or a domain-specific database
  • Handle entity disambiguation (e.g., "Washington" could be a state, person, or city)
  • Cache entity lookups to avoid latency spikes
# Production-ready entity resolution with caching
from functools import lru_cache
import requests

class ProductionEntityResolver:
    def __init__(self, wikidata_endpoint: str = "https://query.wikidata.org/sparql"):
        self.endpoint = wikidata_endpoint

    @lru_cache(maxsize=10000)
    def resolve_entity(self, entity_text: str, entity_type: str) -> bool:
        """Cache entity resolutions to avoid repeated API calls."""
        query = f"""
        SELECT ?item WHERE {{
          ?item wdt:P31 wd:{self._type_to_wikidata(entity_type)} .
          ?item rdfs:label "{entity_text}"@en .
        }}
        LIMIT 1
        """
        response = requests.get(self.endpoint, params={"query": query})
        return len(response.json().get("results", {}).get("bindings", [])) > 0

2. False Positives in Coherence Analysis

The semantic coherence analyzer will flag legitimate uses of antonyms (e.g., "The temperature rose then fell"). To reduce false positives:

  • Implement context windows that check for temporal or conditional markers
  • Use dependency parsing to identify negation scope
  • Add a confidence threshold based on the number of supporting contradictions
def _is_legitimate_contradiction(self, claim1: Dict, claim2: Dict, context: str) -> bool:
    """Filter out false positives from temporal sequences."""
    temporal_markers = {"then", "after", "before", "subsequently", "later"}
    context_lower = context.lower()
    if any(marker in context_lower for marker in temporal_markers):
        return False  # Likely a temporal sequence, not contradiction
    return True

3. Model-Specific Surprisal Baselines

Different LLMs have different surprisal distributions. You need model-specific baselines:

class ModelAwareIntentionalityDetector(IntentionalityDetector):
    def __init__(self, model_name: str):
        super().__init__(model_name)
        # Load pre-computed baselines for this model
        self.baselines = {
            "gpt2": {"mean_surprisal": 5.2, "std_surprisal": 2.1},
            "gpt2-medium": {"mean_surprisal": 4.8, "std_surprisal": 1.9},
            "gpt2-large": {"mean_surprisal": 4.5, "std_surprisal": 1.7}
        }
        self.model_baseline = self.baselines.get(model_name, 
                                                  {"mean_surprisal": 5.0, "std_surprisal": 2.0})

    def _normalize_surprisal(self, surprisal: float) -> float:
        """Z-score normalize against model baseline."""
        return (surprisal - self.model_baseline["mean_surprisal"]) / self.model_baseline["std_surprisal"]

4. Memory and Latency Considerations

The full pipeline can be memory-intensive. For production deployment:

  • Batch process evaluations to amortize model loading costs
  • Use model quantization (e.g., torch.quantization) for the intentionality detector
  • Implement a fallback mode that skips expensive checks for short outputs
class ProductionEvaluationPipeline:
    def __init__(self, use_quantized: bool = True):
        self.grounding_checker = EntityGroundingChecker()
        self.coherence_analyzer = SemanticCoherenceAnalyzer()
        if use_quantized:
            # Load quantized model for faster inference
            self.intentionality_detector = self._load_quantized_detector()
        else:
            self.intentionality_detector = IntentionalityDetector()

    def evaluate(self, text: str, min_length: int = 50) -> Dict:
        """Run evaluation with length-based optimizations."""
        results = {"grounding": self.grounding_checker.check_grounding(text)}

        # Skip coherence analysis for very short texts
        if len(text.split()) > 10:
            results["coherence"] = self.coherence_analyzer.check_coherence(text)
        else:
            results["coherence"] = {"score": 1.0, "contradictions": [], "total_claims": 0}

        # Skip intentionality for very short texts (model loading overhead)
        if len(text) > min_length:
            results["intentionality"] = self.intentionality_detector.analyze_intentionality(text)
        else:
            results["intentionality"] = {"intentionality_score": 0.5, "is_pattern_matching": False}

        return results

What's Next

Bender's framework provides a rigorous foundation for evaluating LLM outputs, but it's not a complete solution. Here are the next steps for production systems:

  1. Integrate with monitoring systems: Feed evaluation scores into dashboards like Grafana to track model behavior over time
  2. Build domain-specific knowledge bases: The grounding checker is only as good as its reference data. For medical or legal applications, you'll need curated knowledge bases
  3. Implement automated rollback triggers: When intentionality scores drop below 0.3, automatically fall back to a simpler, more predictable model

The evaluation pipeline we built implements Bender's core insights about referential grounding, semantic coherence, and communicative intent [1]. These metrics catch failure modes that surface metrics miss. In production, this means fewer hallucinations, more reliable outputs, and a system that degrades gracefully when the model doesn't understand the content it's generating.

The key takeaway from Bender's work is that LLM evaluation must go beyond fluency metrics. A model that generates perfect English sentences can still be fundamentally unreliable if it lacks grounding in real-world knowledge and genuine communicative intent. The tools we've built here give you concrete, measurable ways to detect these failures before they reach end users.


References

1. Wikipedia - Transformers. Wikipedia. [Source]
2. Wikipedia - Rag. Wikipedia. [Source]
3. Wikipedia - OpenAI. Wikipedia. [Source]
4. GitHub - huggingface/transformers. Github. [Source]
5. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]
6. GitHub - openai/openai-python. Github. [Source]
7. GitHub - Significant-Gravitas/AutoGPT. Github. [Source]
8. OpenAI Pricing. Pricing. [Source]
tutorialaiml
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles