Probing Classifiers for Linguistic Structure

Why This Matters

A pretrained language model — BERT, GPT, LLaMA — has hundreds of millions to hundreds of billions of parameters. The representations it produces (per-token vectors at each layer) are high-dimensional and opaque. The empirical question: what linguistic properties has the model implicitly learned?

The dominant methodology since 2019: probing classifiers. Train a small (often linear) classifier to predict a linguistic label — POS tag, dependency relation, syntactic tree depth, named-entity type — from a frozen LM's hidden states. If the probe succeeds under appropriate controls, the representation contains information predictive of the property in a linearly extractable way.

The probing-classifier methodology produced the foundational empirical findings of the LLM-and-linguistics era:

BERT rediscovers the classical NLP pipeline (Tenney et al. 2019): different layers encode different levels of linguistic structure (POS at lower layers, syntax in the middle, coreference at the top). The classical pipeline ordering (POS → parse → coreference → semantics) is mirrored in BERT's layer hierarchy.
The structural probe recovers syntactic trees from BERT embeddings (Hewitt-Manning 2019): a linear projection of BERT activations into a low-dimensional "syntax space" recovers most of the dependency-tree structure.
LSTMs learn long-distance subject-verb agreement (Linzen-Dupoux-Goldberg 2016, the early result): even pre-transformer models capture non-trivial syntactic dependencies.

Probing is one major method for asking what information an LM representation contains. It is a bridge from empirical benchmark scores to interpretability-style claims, but it is not a causal mechanism by itself.

The Probing Methodology

The basic recipe:

Take a pretrained LM (BERT, GPT-2, LLaMA, etc.).
Freeze the model's weights.
For each layer $\ell$ and each token position $t$ , extract the hidden state $h^{(\ell)}_t$ .
Train a small classifier $f$ (often linear, sometimes a small MLP) to predict a linguistic label $y_t$ from $h^{(\ell)}_t$ .
Report the probe's accuracy as the model's "knowledge" of the linguistic property at layer $\ell$ .

Variations across studies:

Linear probe: a single linear layer. Most common; tests whether the property is linearly encoded.
MLP probe: a small multi-layer perceptron. Tests whether the property is encoded but possibly non-linearly.
Edge probe (Tenney et al. 2019): probes pairs of token representations for relations (dependency arc, coreference link). The standard method for relation-level probing.
Structural probe (Hewitt-Manning 2019): probes for tree distances rather than per-token labels. Reconstructs the full syntactic tree from a low-dimensional projection.

Layer-Wise Probing Results

Tenney et al. 2019 showed the canonical result: different layers of BERT encode different linguistic information at different strengths. Their plot of probing F1 vs layer for multiple tasks:

POS tagging: peaks at layers 3-4 (early); accuracy stays high at later layers.
Constituency parsing: peaks at layers 6-7 (middle).
Dependency parsing: peaks at layers 6-8.
Semantic role labeling: peaks at layers 9-11 (later).
Coreference resolution: peaks at layers 11-12 (top).

The ordering matches the classical NLP pipeline: parse before semantics, semantics before coreference. BERT, trained solely on masked-language-modeling, rediscovered this structure without explicit supervision.

This is one of the cleanest empirical results in the LLM-meets- linguistics literature. It suggests transformers learn a hierarchical organization of linguistic information matching human linguistic theorizing.

The Structural Probe

Hewitt-Manning 2019 introduced a probe for syntactic tree distance. Train a low-rank linear projection $B$ such that the projected dot-product distance between any two token representations matches the syntactic-tree distance between those tokens.

Specifically, for tokens $i$ and $j$ in a sentence:

d_{\text{tree}}(i, j) \approx \| B (h_i - h_j) \|^2

where $h_i, h_j$ are BERT's hidden states. The probe is trained on a treebank (Penn Treebank) by minimizing the squared error between the predicted distance and the gold tree distance.

Results: a 32-dimensional projection of BERT's 768-dimensional hidden state recovers >90% of the dependency-tree structure on held-out sentences. The "syntax space" is a 32-dim subspace of BERT's representation; this is the empirical evidence for syntactic structure being geometrically encoded.

Control Tasks (Hewitt-Liang 2019)

A subtle issue: a probe might succeed because the LM has learned the property or because the probe itself is powerful enough to memorize the labels.

Hewitt-Liang 2019 introduced control tasks: train the probe to predict random labels assigned to each word type. If the probe also succeeds at the control task, its expressive power is the issue, not the model's knowledge.

The recommendation: report the gap between the probe's performance on the real task and on the control task. Selectivity = (real-task accuracy) − (control-task accuracy). High selectivity = the model has the property; low selectivity = the probe is too powerful.

This methodological refinement was widely adopted; modern probing studies report selectivity alongside accuracy.

Information-Theoretic Probing

Voita-Titov 2020 reframed probing in information-theoretic terms. The relevant quantity is the minimum description length (MDL) of the probe: a probe whose accuracy comes "for free" (small MDL) reveals genuine model knowledge; a probe whose accuracy requires a complex probe (large MDL) reveals probe memorization.

The MDL-probing methodology gives a numerical score for each probing experiment, comparable across tasks and models. It is one influential way to control for probe complexity.

What a Probe Does and Does Not Show

A successful controlled probe can show...	It does not show by itself...
The representation contains information predictive of the label	The model causally uses that information during generation
The information is linearly or non-linearly extractable, depending on probe class	The feature is represented in a human-like way
The information appears more strongly in some layers than others	The result is robust across prompts, models, domains, or languages
A comparison between real labels and control labels	The probe has discovered a real circuit

ML and Linguistic-Theory Connections

What the probing literature has found

Across hundreds of probing studies on BERT, RoBERTa, GPT-2, T5, LLaMA, and others, recurring findings:

Lower layers: surface-level features (POS tags, basic morphology, character-level information).
Middle layers: syntactic structure (constituency, dependency, agreement).
Upper layers: semantic and pragmatic information (coreference, sentiment, relation extraction).
Final layers: increasingly task-specific; the generalization-vs-specialization trade-off.

The pattern is robust across models and probing methods.

Limitations and critiques

Probing has methodological limits:

Probe expressivity vs model knowledge: even with control tasks, probes can extract information the model doesn't actively use during prediction.
Linear vs non-linear: a property encoded non-linearly may not be linearly encoded but might still be used by the model.
Frozen vs finetuned: probes operate on frozen representations. Finetuning the model changes the representations; what is extractable before finetuning may not be extractable afterward.
Causal vs correlational: probing is correlational; it shows the property is encoded, not that it is used.

The mechanistic-interpretability program (activation patching, sparse autoencoders) addresses some of these limits by adding causal interventions. See mechanistic interpretability on TheoremPath.

Probing beyond linguistics

The same methodology is used outside linguistics for properties such as refusal behavior, sentiment, calibration, or truthfulness signals. Those uses belong more directly to alignment and mechanistic interpretability than to linguistics, but the statistical caution is the same: a classifier can reveal extractable information without proving causal use.

Common Mistakes

Watch Out

Treating probe accuracy as model knowledge

A probe's accuracy depends on (1) what the model encodes and (2) the probe's expressive capacity. High probe accuracy can mean either. Control tasks and MDL probing distinguish these; single-number probe accuracy without controls is uninformative.

Watch Out

Generalizing one probing result across models

Different transformer models encode different things. BERT's layer-wise findings don't always transfer to GPT-style models or to LLaMA. Each model architecture and training run can produce different probing-result patterns.

Watch Out

Confusing probing with mechanistic interpretability

Probing is correlational and global: it shows the representation contains the information. Mechanistic interpretability is causal and local: it identifies which part of the network uses the information. The two are complementary; they answer different questions.

Watch Out

Forgetting that probing requires labeled data

Probing requires gold labels for training the probe. Linguistic-structure probes use treebanks; semantic-role probes use SRL annotations. The probe is only as good as the labels. For low-resource phenomena or languages, probing is limited.

Cross-Network Links

LinguisticsPath internal: prerequisite vector-semantics-and-word2vec-revisited; next natural topics are contextual embeddings and syntactic circuits.
TheoremPath direction: BERT, contextual embeddings, and mechanistic interpretability provide the model-side background.
ComputationPath: transformers-as-formal-computational- models gives the theoretical bracket on what probing can find.
DSAPath: linear and small-MLP classifier training is standard ML; the probing methodology uses these as building blocks.

References

Canonical:

Belinkov, Yonatan, and James Glass. "Analysis Methods in Neural Language Processing: A Survey." TACL 7 (2019) 49-72.
Linzen, Tal, Emmanuel Dupoux, and Yoav Goldberg. "Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies." TACL 4 (2016) 521-535.
Hewitt, John, and Christopher D. Manning. "A Structural Probe for Finding Syntax in Word Representations." NAACL (2019).
Tenney, Ian, Dipanjan Das, and Ellie Pavlick. "BERT Rediscovers the Classical NLP Pipeline." ACL (2019).

Methodological Controls:

Hewitt, John, and Percy Liang. "Designing and Interpreting Probes with Control Tasks." EMNLP (2019).
Voita, Elena, and Ivan Titov. "Information-Theoretic Probing with Minimum Description Length." EMNLP (2020).
Manning, Christopher D., et al. "Emergent Linguistic Structure in Artificial Neural Networks Trained by Self-Supervision." PNAS 117 (2020) 30046-30054.