Computer scienceData scienceNLPLanguage representationTransformers

BERT Transformer

8 minutes read

Transformer-based models have become the state-of-the-art solution for most NLP tasks. In this topic, we will explore one of the most commonly used transformer architectures: BERT.

BERT architecture

BERT stands for Bidirectional Encoder Representations from Transformers. Its bidirectional nature allows it to capture contextual relationships between words comprehensively. The core component of BERT is its self-attention mechanism, represented as a gray rectangle in the diagram.

  • Self-attention mechanism
    This mechanism enables the model to weigh the significance of different words in a sentence while considering their contextual relationships. Think of it as a way for the model to focus more on certain words based on their relevance to the overall sentence meaning. For example, when you read the sentence, "The cat sat on the mat," your brain intuitively links "mat" with "cat" and recognizes their relationship. Self-attention mimics this process mathematically.

    Here's how self-attention works:
    Queries, Keys, and Values: For each word in a sentence, the self-attention mechanism computes three linear transformations of the original word embeddings: queries, keys, and values.
    Calculating Attention Scores: The model calculates attention scores between each query and all keys. These scores indicate how much focus each word (query) should place on other words (keys) in the sentence. The scores are determined by the similarity between the query and key representations.
    Weighted Sum of Values: Using the attention scores, the model takes a weighted sum of the values. This process emphasizes words that are most relevant to the current word, considering the relationships between words in the sentence.

  • Multi-Head Attention
    Multi-head attention is a technique used to capture various types of relationships simultaneously. Think of it as allowing the model to view the sentence from multiple perspectives or lenses. In multi-head attention, we divide our embeddings into n parts, which are processed by n identical heads using the self-attention mechanism. For instance, one head might focus on syntactic relationships, while another could concentrate on semantic connections.

Pretraining and fine-tuning

Pretraining involves two specific tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

1. Masked Language Modeling (MLM)

In the MLM task, BERT learns to predict missing words within a sentence. This is achieved by masking (hiding) certain words and having the model predict them based on the unmasked context. To do it, a certain percentage of words in the input text are replaced with the [MASK] token. Because BERT sees the entire sentence but is asked to predict missing words that could be in the middle of the sentence, it learns to capture the bidirectional context of the words.

2. Next Sentence Prediction (NSP)

The NSP task aims to enable BERT to understand relationships between pairs of sentences and predict whether one sentence logically follows the other. During NSP training, pairs of sentences are labeled as either the "current" sentence (A) or the "next" sentence (B). NSP training allows BERT to grasp the flow of ideas between sentences and the structure of texts.

Fine-tuning involves adapting BERT to specific NLP tasks using task-specific datasets. The first step in fine-tuning is to modify the output layer of the pretrained BERT model by adding a task-specific layer, such as a classification layer for sentiment analysis or a sequence labeling layer for named entity recognition. During fine-tuning, you can choose which layers of the BERT model to update. Often, the early layers are kept frozen, while the task-specific and some of the later layers are updated. This strategy preserves the general knowledge BERT has acquired while adapting it to the specific task.

Tokenization

BERT employs a tokenization technique known as WordPiece tokenization. This method breaks down words into smaller subword units, or tokens, based on their likelihood of appearing together in the data. To better handle sequences, some special tokens are used:

  • [CLS]: Added at the beginning of each input sequence for classification tasks.

  • [SEP]: Placed between sentences A and B in the Next Sentence Prediction task to separate the two sentences.

  • [MASK]: Randomly positioned within the sequence during the MLM pretraining task.

  • [PAD]: Added either at the beginning or the end of sequences to make them of equal length during batch training, which is necessary for efficient parallel processing.

Advantages and disadvantages

Although BERT is considered a state-of-the-art model for many tasks, it has some pros and cons:

Advantages:

  1. Bidirectional contextual understanding: BERT is bidirectional, enabling it to consider both the left and right context for each word. This capability allows it to capture more comprehensive context and relationships between words in a sentence, allowing it to understand nuances, word meanings, and linguistic dependencies.

  2. Transfer learning: BERT's pretrained representations can be fine-tuned for various NLP tasks, reducing the need to train models from scratch.

  3. Fine-grained features: Thanks to subword tokenization, BERT can handle rare words, out-of-vocabulary terms, and morphologically rich languages.

Disadvantages:

  1. Interpretable representations: Understanding the learned representations within BERT can be challenging, potentially limiting its use in safety-critical applications.

  2. Limited understanding of world knowledge: BERT's pretraining doesn't incorporate external knowledge bases, limiting its understanding to the text on which it was pretrained.

  3. Static masking strategy: In the first version of BERT, a static masking strategy was employed. This means that only a certain subset of words were masked in various ways before training began. Subsequently, the dataset was duplicated n times, causing the model to encounter the same masked words across n epochs. This approach is relatively slow and offers limited insight into the relationships between different words in the sequence, as some words may never be masked.

BERT variations and extensions

Several variations of BERT aim to address its limitations or enhance specific aspects:

1. RoBERTa (A Robustly Optimized BERT Pretraining Approach): RoBERTa uses larger batch sizes and longer training times while omitting the NSP task and changing static MLM to dynamic masking.

2. ALBERT (A Lite BERT for Self-supervised Learning of Language Representations): This model uses parameter-sharing techniques to reduce model size and enhance efficiency. ALBERT shares parameters across layers and reduces the entire word vocabulary into a lower-dimensional space.

3. ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately): This variant proposes a novel pretraining objective called "replaced token detection," as opposed to masked token prediction. ELECTRA trains a generator to replace words in the input text and a discriminator to distinguish between original and replaced tokens.

4. DistilBERT (Distillation of BERT Model for Sequence-to-Sequence Tasks): Focused on model compression and speed optimization, DistilBERT uses a teacher-student framework where a larger BERT model (the teacher) guides the training of a smaller model (the student). The student model learns to mimic the teacher's behavior, which leads to a smaller model.

Conclusion

BERT has revolutionized the domain of NLP tasks with its transformer-based architecture, bidirectional capabilities, and pretraining plus fine-tuning approach. You can adapt the BERT model to nearly any NLP task using domain-specific dataset.

How did you like the theory?
Report a typo