14 minutes read

We use language models every day. When you type a sentence on your smartphone or tab and the device offers you different sentence endings, it is the language model, not the device, that predicts what you want to write. Language models are used in NLP to solve the following tasks: translating a text, summarizing a text, speech recognition, natural language generation, etc.

Language model concepts

As with any other model, language models need context. Let's take an abstract example: you wish to predict a future with a dataset of all human history. You would like to know what will happen in the next two years. Your model will try to find similar contexts (similar situations) in the history dataset and will provide you with an output based on the probable event in the past. But that's not enough — a good model would simulate the behavior of the real world: it would "understand" which events are more likely.

And so does a language model (LM). It is a probability distribution over sequences of tokens, letters, and so on. If we have two sentences Peter is a good lawyer and the is Peter lawyer a good, a language model will decide which of them is more likely to occur in the text: in our case, the first sentence will have a higher score and will be shown as an output.

Language models solve two major tasks:

  1. It predicts the probability of the whole sequence;

  2. It predicts the probability of the next word's occurrence.

We define two LM types:

  • Non-neural LMs, traditional statistical models, usually involves making an nn-th order Markov assumption and estimating nn-gram probabilities via counting and subsequent smoothing. Count-based include N-gram and Exponential models;

  • Neural LMs:

    • Feed-forward neural probabilistic (NPLM) solves the problem of data sparsity of the n-gram model, by representing words as vectors (word embeddings) and using them as inputs to a NLM;

    • Recurrent neural network (RNN) doesn't use limited context, unlike NPLMs;

    • Transformers.

There are also deterministic LMs, which were in the spotlight long ago. We can view finite state automation as a basic deterministic model. Its principle can be described by the picture below:

Two circles with the words "I" and "wish" are shown here, from which arrows go to each other

This model can produce the iteration I wish I wish I wish..., but not wish I wish I ... or I wish I. As you could expect, such models show poor results. Later, they were displaced by probabilistic models.

We cannot reliably estimate sentence probabilities if we treat them as atomic units. Instead, let's decompose the probability of a sentence into probabilities of smaller parts. So, then we can predict word by word. Another way is to predict by an n-gram.

N-gram language model

Unigram LM, as well as N-Gram LM and Exponential LM, are count-based LMs.

A unigram LM implies that we predict a sentence (in other words, generate a sentence) token by token. We can count probabilities for this model by the following formula:

Puni(t1t2t3)=P(t1)P(t2)P(t3)P_{uni}(t_1t_2t_3) = P(t_1)P(t_2)P(t_3)

The sum of all PP should be 11.

Such models generate a sentence in the following way: let's say we have the first word Peter that could be generated randomly (if we have a text generation task without primary settings) or could be generated on some good reason from the input data. We will assume that we generate a text about a lawyer. Then, we count the possibility of Peter occurring in the text: P(Peter)P(Peter). Then, we count the second word's occurring probability: P(is)P(is). And so on for the whole sentence. In the end, we get the sentence: Peter is a good lawyer.

To be sure that those tokens should be typed in this exact order, we multiply all the possibilities:

P(Peter is a good lawyer)=P(Peter)P(is)P(a)P(good)P(lawyer)P(\text{Peter is a good lawyer}) = P(\text{Peter}) \cdot P(\text{is}) \cdot P(\text{a}) \cdot P(\text{good}) \cdot P(\text{lawyer})

An N-gram LM is a more complicated model. In an N-gram model, the probability P(w1,...,wm)P(w_1, ..., w_m) of observing the sentence w1,...,wmw_1, ..., w_m is approximated as:

P(w1,...,wm)=i=1mP(wiw1,...,wi1)i=2mP(wiwi(n1),...,wi1)P(w_1, ..., w_m) = \prod_{i=1}^m P(w_i | w_1 , ... , w_{i-1}) \approx \prod_{i=2}^m P (w_i | w_{i-(n-1)}, ..., w_{i-1})

Here, nn is equal to our model specification: if it's a bi-gram, then n=2n=2, if a trigram, then n=3n = 3, and so on. We can predict an n-gram the same way, as we predicted tokens in the Unigram LM — n-gram by n-gram.

Exponential language model

Exponential LMs are also sometimes called maximum entropy LMs. These models use the following formula to count the conditional probability of a token wiw_i given a context hih_i:

P(wihi)=1Z(hi)exp(jλjfj(hi,wi))P (w_i | h_i) = \frac{1} {Z(h_i)} \cdot \exp (\sum_j \lambda_j f_j (h_i, w_i))

In the above formula {\displaystyle \lambda _{j}} are the parameters, fi(hi,wi)f_i (h_i, w_i), which are arbitrary functions of the pair (hi,wi)(h_i, w_i), and a normalization factor:

Z(h)=wVexp(jλjfj(h,w))Z(h) = \sum_{w \in V} \exp (\sum_j \lambda_j f_j (h, w))

The parameters can be obtained from the training data based on the Maximum Entropy principle. To be short, using the principle of maximum entropy and some testable information (for example, the mean), you can find the distribution that makes the fewest assumptions about your data (the one with maximal information entropy). Away from NLP, this is often used in Bayesian inference to determine prior distributions.

Most neural network LMs use the softmax output layer and can be considered exponential LMs.

Feed-forward neural probabilistic models

Neural Network LMs are continuous-spaced LMs and can be divided into two groups: feed-forward neural probabilistic (NPLM) and recurrent neural network (RNN).

Let's start with a feed-forward neural probabilistic LM. It learns the parameters of the conditional probability distribution of the next word, given the previous n1n-1 words, using a feed-forward neural network of three layers. Here is the overview of this model (see the Language Model: A Survey of the State-of-the-Art Technology article on SyncedReview for reference):

There is a scheme of neural network

In this model, we first build a mapping CC from each token ii of the vocabulary VV to a distributed, real-valued feature vector C(i)RmC(i) ∈R^m, where mm being the number of features. CC is a V×m|V| × m matrix, whose row ii is the feature vector C(i)C(i) for token ii. Then, a function gg over tokens maps the input sequence of feature vectors for tokens in context (C(w(tn+1)),,C(w(t1)))(C(w_(t-n+1)),⋯,C(w_(t-1))) to a conditional probability distribution of tokens in VV for the next token wtw_t. In the final stage, we learn the token feature vectors and the parameters of that probability function with a composite function ff, comprised of the two mappings CC and gg with the following formula:

f(i,wt1,...,wtn+1)=g(i,C(wt1),...,C(wtn+1))f(i, w_{t-1}, ..., w_{t-n+1}) = g(i, C(w_{t-1}), ..., C(w_{t-n+1}))

This neural network approach can solve the dispersion problem and generalize well in comparison to the n-gram models in terms of perplexity.

Recurrent neural networks

Another neural model type is a recurrent neural network (RNN). RNNs are not just an LM, it is used in many other deep-learning tasks.

While the previous neural network type used a limited context size; this model does not. By using recurrent connections, information can cycle inside these networks for an arbitrarily long time. By RNN, we also mean LSTM, GRU, and vanilla RNN.

Below is the picture of a Simple one-layer RNN architecture (see the Models: Recurrent section of the Language Modeling article by Lena Voita):

There is a scheme of RNN model

At each step, the current state contains information about the previous tokens, and you can use it to predict the next token. For training, you feed the training examples. At inference, you feed as context the tokens your model generated; this usually happens until the _eos_ token is generated.

There are also multi-layer RNN models. In this case, inputs for higher RNNs are representations coming from the previous layer. The main hypothesis is that with several layers, lower layers will catch local phenomena, while higher layers will be able to catch longer dependencies. Here is an example of a 2-layer RNN LM:

There is a scheme of RNN model with 2 layers

Other neural network models

We should also mention Convolutional neural network models (CNN). CNN is a somewhat more advanced version of RNN. CNN models are commonly used in vision tasks. Mobile devices have optimized GPUs and even specialized hardware to efficiently train and run CNN models. CNN models include convolution layers that extract features from the data.

The CNN language model was first introduced in 2016 by the Yann N. Dauphin team. Their model achieved competitive results on the WikiText-103 and Google Billion Words benchmarks. The architecture involves stacking gated convolutional blocks, which benefit from the parallel computation.

Other types of Neural LMs are transformers. Transformers are deep-learning models, used primarily in NLP. Before Transformers, most state-of-the-art NLP systems relied on gated RNNs. Transformers also make use of attention mechanisms but, unlike RNNs, do not have a recurrent structure. If you have enough training data, attention mechanisms can match the RNNs' performance. The following models can be defined as Transformers:

  • GPT-3 is an autoregressive LM, which was introduced in 2020 by OpenAI. Its architecture is a standard transformers network;

  • BART was proposed in 2019 by Facebook AI Research. It uses a standard seq2seq/machine translation architecture with a bidirectional encoder and a left-to-right decoder (like GPT).

Training language models

Training is the process of sending a model initial data for further prediction. GPT-3 is a pre-trained model. You can always specify the model for your tasks, this is called fine-tuning.

It's worth mentioning, that all these models should be trained on a specific corpus. For machine translation tasks, it is helpful to use the UN multilingual parallel corpus. This corpus is available in six UN official languages.

For text generation or text classification, you can use the Wikipedia corpus. The Wikicorpus is a trilingual corpus (Catalan, Spanish, English) that contains large portions of Wikipedia. In its present version (2022), it contains over 750 million words. You can also load this corpus with Gensim.

For text summarization, you can use Newsela Dataset. It's a restricted dataset, you can get access to it if you make a request. Newsela Dataset is a simplification corpus of 1130 news articles, re-written by professional editors to meet the readability standards for children at multiple grade levels.

Language model evaluation

A common way to evaluate a model is to compare human-created sample benchmarks created from typical language-oriented tasks. Various data sets have been developed to use to evaluate language processing systems. These include Stanford Sentiment Treebank, Quora Question Pairs, Microsoft Research Paraphrase Corpus, GLUE, and SuperGLUE. Again, these benchmarks should be used for fine-tuning, not training.

But let's move to metrics, an automated way to evaluate a model. The first one we will talk about is log-likelihood. Here is the formula to measure it:

L(y1:M)=L(y1,y2,...,yM)=t=1Mlog2p(yty<t)L(y_{1:M}) = L(y_1, y_2, ..., y_M) = \sum_{t=1}^{M} \log_2 p(y_t|y_{<t})

Then, we can also try to measure perplexity:

Perplexity(y1:M)=21ML(y1:M)\text{Perplexity}(y_{1:M})=2^{-\frac{1}{M}L(y_{1:M})}

A good model should have a high log-likelihood and low perplexity.

The last metric is the word error rate which measures how much the predicted sequence of words differs from the actual sequence of words in the correct transcript.

WER=Insertions+Deletions+SubstitutionsTotal words\text{WER} = \frac{\text{Insertions} + \text{Deletions} + \text{Substitutions}} {\text{Total words}}

Conclusion

In this topic, we've discussed the essence of language models. In particular, we showed:

  • Non-neural language models, such as unigram models, n-gram models, and exponential ones. We have also shortly mentioned descriptive language models.

  • Neural language models, such as RNN-, LSTM- and neural probabilistic ones.

  • We showed various methods of model evaluation, including Word Error Rate, Perplexity, and Log-likelihood.

Now let's do some tasks!

8 learners liked this piece of theory. 5 didn't like it. What about you?
Report a typo