Computer scienceData scienceNLPLanguage representation

Language models before Transformers

6 minutes read

In this topic, you are going to learn about language models that were commonly used before the development of Transformer architecture. While contemporary language models like BERT or GPT are widely employed nowadays, it is still possible to construct simpler architectures suitable for less resource-intensive tasks.

Unigram language model

One such model is the Unigram language model, which belongs to the category of count-based language models along with N-Gram LM and Exponential LM.

In a Unigram LM, sentence generation occurs token by token. To calculate the probabilities for this model, you can use the following formula:

$P_{uni}(t_1t_2t_3) = P(t_1)P(t_2)P(t_3)$

The sum of all $P$ should be $1$ .

Such models generate sentences by first selecting a word, such as "Peter," either randomly or based on input data. Assuming we are generating a text about a lawyer, we then calculate the probability of "Peter" occurring in the text: $P(\text{Peter})$ . We repeat this process for each subsequent word, such as "is": $P(\text{is})$ , and so on for the entire sentence. Finally, we obtain the sentence "Peter is a good lawyer."

To be sure that those tokens should be typed in this exact order, we multiply all the possibilities:

$P(\text{Peter is a good lawyer}) = P(\text{Peter}) * P(\text{is}) * P(\text{a}) * P(\text{good}) * P(\text{lawyer})$

N-gram language model

An N-gram LM is a more complicated model. In an N-gram model, the probability $P(w_1, ..., w_m)$ of observing the sentence $w_1, ..., w_m$ is approximated as:

$P(w_1, ..., w_m) = \prod_{i=1}^m P(w_i | w_1 , ... , w_{i-1}) \approx \prod_{i=2}^m P (w_i | w_{i-(n-1)}, ..., w_{i-1})$

Here, $n$ is equal to our model specification: if it's a bi-gram, then $n=2$ , if a trigram, then $n = 3$ , and so on. We can predict an n-gram the same way, as we predicted tokens in the Unigram LM — n-gram by n-gram.

This model has many subtypes: the Katz's back-off model and simple smoothing methods like Add-one smoothing, Add-delta smoothing, Interpolated smoothing, and others.

Exponential language model

Exponential LMs are also sometimes called maximum entropy LMs. These models use the following formula to count the conditional probability of a token $w_i$ given a context $h_i$ :

$P (w_i | h_i) = 1 / Z(h_i) * \exp (\sum_j \lambda_j f_j (h_i, w_i))$

In the above formula, $\lambda _{j}$ are the parameters, $f_i (h_i, w_i)$ , which are arbitrary functions of the pair $(h_i, w_i)$ , and a normalization factor:

$Z(h) = \sum_{w \in V} \exp (\sum_j \lambda_j f_j (h, w))$

The parameters can be obtained from the training data based on the Maximum Entropy principle. To be short, using the principle of maximum entropy and some testable information (for example, the mean), you can find the distribution that makes the fewest assumptions about your data (the one with maximal information entropy). Away from NLP, this is often used in Bayesian inference to determine prior distributions.

Most neural network LMs use the softmax output layer and can be considered exponential LMs.

Feed-forward neural probabilistic models

Neural Network LMs are continuous-spaced LMs and can be divided into two groups: feed-forward neural probabilistic (NPLM) and recurrent neural network (RNN).

First, let's explore the feed-forward neural probabilistic LM. This model utilizes a three-layer feed-forward neural network to learn the parameters of the conditional probability distribution of the next word, given the previous $n-1$ words. For more details, refer to the Language Model: A Survey of the State-of-the-Art Technology article on SyncedReview.

There is a scheme of neural network

In this model, we first build a mapping $C$ from each token $i$ of the vocabulary $V$ to a distributed, real-valued feature vector $C(i) ∈R^m$ , where $m$ being the number of features. $C$ is a $|V| × m$ matrix, whose row $i$ is the feature vector $C(i)$ for token $i$ . Then, a function $g$ over tokens maps the input sequence of feature vectors for tokens in context $(C(w_(t-n+1)),⋯,C(w_(t-1)))$ to a conditional probability distribution of tokens in $V$ for the next token $w_t$ . In the final stage, we learn the token feature vectors and the parameters of that probability function with a composite function $f$ , comprised of the two mappings $C$ and $g$ with the following formula:

$f(i, w_{t-1}, ..., w_{t-n+1}) = g(i, C(w_{t-1}), ..., C(w_{t-n+1}))$

This neural network approach can solve the dispersion problem and generalize well in comparison to the n-gram models in terms of perplexity.

Recurrent neural networks

The recurrent neural network (RNN) is another type of neural model that goes beyond just being a language model (LM). RNNs are versatile and used in various deep-learning tasks.

Unlike the previous neural network type, RNNs do not have a limited context size. By incorporating recurrent connections, information can circulate within these networks for an extended period. RNNs encompass different variations, including LSTM, GRU, and vanilla RNN.

Take a look at the image below, which depicts a simple one-layer RNN architecture. You can find more information in the "Models: Recurrent" section of the Language Modeling article by Lena Voita.

There is a scheme of RNN model

At each step, the current state contains information about the previous tokens, and you can use it to predict the next token. For training, you feed the training examples. At inference, you feed as context the tokens your model generated; this usually happens until the _eos_ token is generated.

In addition, there are multi-layer RNN models. In such models, inputs for higher layers are representations derived from the previous layer. The underlying hypothesis is that with multiple layers, lower layers can capture local patterns, while higher layers are better equipped to capture longer dependencies. Here is an example of a 2-layer RNN LM:

There is a scheme of RNN model with 2 layers

Conclusion

In conclusion, language models have evolved over time, with different architectures being used before the emergence of Transformers. Unigram language models, N-gram language models, and exponential language models were prevalent approaches to language modeling and can be used further for simple tasks in situations when you don't have as many computational resources.

How did you like the theory?

Report a typo