BERTology

Posted on 2020-03-02 Edited on 2025-07-16 In NLP Views: Word count in article: 8.5k Reading time ≈ 8 mins.

note for A Primer in BERTology: What we know about how BERT works

BERT Embeddings

As an NLU encoder, BERT generates context-dependent embeddings, with research focusing on:
- BERT embeddings form clear clusters related to word sense
- Some researchers found that embeddings of the same word vary with position, seemingly related to the Next Sentence Prediction (NSP) task
- Studies on word representations across different layers revealed that higher layers are more context-related, and embeddings become more dense (occupy a narrow cone in the vector space) in higher layers, with cosine distances between random words being closer than expected in an isotropic space

Syntactic Knowledge

BERT's representation is hierarchical rather than linear, capturing more syntactic tree-like information than word order
BERT encodes POS, chunk, and other information, but doesn't fully capture syntactic information (some long-distance dependencies are ignored)
Syntactic information is not directly encoded in self-attention weights but requires transformation
BERT considers subject-predicate agreement in cloze tasks
BERT cannot understand negation and is not sensitive to malformed input
Its predictions remain unchanged by word order reversal, sentence splitting, or subject-predicate removal
In essence, BERT encodes syntactic information without fully utilizing it

Semantic Knowledge

Probing tasks in Masked Language Model (MLM) suggest BERT can encode some semantic role information
BERT encodes entity types, relations, semantic roles, and proto-roles
Due to wordpiece preprocessing, BERT performs poorly in numerical encoding and reasoning

World Knowledge

In certain relationships, BERT outperforms knowledge base-based methods, capable of knowledge extraction with good template sentences
However, BERT cannot use this knowledge for reasoning
Research has found that BERT's knowledge is often guessed through stereotypical character combinations, not factually accurate (e.g., it would predict that a person with an Italian-sounding name is Italian, even when it is factually incorrect)

Self-Attention Heads

Research has categorized attention heads into several types:
- Attending to self, adjacent words, sentence end
- Attending to adjacent words, CLS, SEP, or distributed across the entire sequence
- Or the following 5 types
Attention weight meaning: How other words are weighted when calculating the next layer representation
Self-attention does not directly encode linguistic information, as most heads are heterogeneous or vertical, related to excessive parameters
Few heads encode words' syntactic roles
A single head cannot capture complete syntactic tree information
Even heads that capture semantic relationships are not necessary for improving related tasks

Layers

Lower layers contain the most linear word order relationships; higher layers have weaker word order information and stronger knowledge information
BERT's middle layers contain the strongest syntactic information, potentially capable of reconstructing syntactic trees
Middle layers have the best transfer performance and capabilities
However, this conclusion is conflicting: some find lower layers better for chunking, higher layers for parsing, while others find middle layers best for tagging and chunking
During fine-tuning, lower layers' changes have minimal performance impact; the last layer changes most significantly
Semantic information exists across all layers

Pre-training

Original tasks were MLM and NSP, with research proposing improved training objectives:
- Removing NSP has minimal impact, especially in multilingual versions
- NSP can be extended to predict adjacent sentences or use inverted sentences as negative samples
- Dynamic masking can improve performance
- Beyond-sentence MLM: replacing sentences with arbitrary strings
- Permutation language modeling (XLNet): shuffling word order, predicting from left to right
- Span boundary objective: using span boundary words for prediction
- Phrase masking and named entity masking
- Continual learning
- Conditional MLM: replacing segmentation embedding with label embedding
- Replacing MASK token with [UNK] token
Another improvement path involves datasets, attempting to integrate structured data or common-sense information through entity embeddings or semantic role information (e.g., E-BERT, ERNIE, SemBERT)
Regarding pre-training necessity: it makes models more robust, but effectiveness varies by task

Model Architecture

Layer count is more important than head count
Large batches can accelerate model convergence (batch size of 32k can reduce training time without performance degradation)
"A robustly optimized BERT pretraining approach" published optimal parameter settings
Since higher layer self-attention weights resemble lower layers, training shallow layers first and copying parameters to deeper layers can improve training efficiency by 25%

Fine-tuning

Some view fine-tuning as teaching BERT what information to ignore
Fine-tuning suggestions:
- Consider weighted outputs from multiple layers, not just the last layer
- Two-stage fine-tuning
- Adversarial token perturbations
Adapter modules can accelerate fine-tuning
Initialization is important, but no papers have systematically investigated this

Overparametrization

BERT doesn't effectively utilize its massive parameters; most heads can be pruned
Heads in one layer are mostly similar, potentially reducible to a single head
Some layers and heads can degrade model performance
On subject-predicate agreement and subject detection, larger BERT models sometimes perform worse than smaller ones
Using the same MLP and attention dropout in a layer might contribute to head redundancy

Compression

Two primary methods: quantization and knowledge distillation
Other approaches include progressive model replacing, embedding matrix decomposition, and converting multiple layers to a single recurrent layer

Multilingual BERT

Multilingual BERT performs excellently in zero-shot transfer for many tasks but poorly in language generation
Improvement methods:
- Fixing lower layers during fine-tuning
- Translation language modeling
- Improving word alignment in fine-tuning
- Combining 5 pre-training tasks (monolingual and cross-lingual MLM, translation language modeling, cross-lingual word recovery, and paraphrase classification)

BERTology

BERT Embeddings

Syntactic Knowledge

Semantic Knowledge

World Knowledge

Self-Attention Heads

Layers

Pre-training

Model Architecture

Fine-tuning

Overparametrization

Compression

Multilingual BERT

BERT embeddings

Syntactic knowledge

Semantic knowledge

World knowledge

Self-attention heads

layers

pre-training

Model architecture

fine-tuning

Overparametrization

Compression

Multilingual BERT