Computer scienceData scienceMachine learningIntroduction to deep learningRecurrent neural networks

LSTM

7 minutes read

Long Short-Term Memory (LSTM) networks are a specialized kind of Recurrent Neural Network (RNN) designed to overcome challenges related to learning long-term dependencies. Traditional RNNs often struggle with remembering information for extended periods due to vanishing gradient, a limitation that LSTMs effectively address.

In this topic, you'll learn about the core components and functionality of LSTM networks. We will explore the structure of LSTM units, including their gates and cell states, and how these elements work together to retain and process information over extended sequences.

LSTM gating mechanism

Regular RNNs store long-term information in their weights. These weights change slowly during training, capturing general knowledge about the data. For short-term information, RNNs rely on activations, which are temporary and move from one node to another. However, these activations don't last long and can be lost, making it hard for RNNs to remember information over longer sequences.

LSTM gates

LSTMs address this issue with a unique component called the memory cell. This cell is a complex arrangement of simpler nodes, including some that multiply values, allowing it to store information more effectively. Each memory cell in an LSTM has its own internal state and gates that control different functions:

Input Gate: Decides how much new information should be added to the cell's internal state.
Forget Gate: Determines whether to remove or retain information from the cell's state.
Output Gate: Controls whether the cell's internal state should affect its output.

The LSTM network takes in data at each time step, along with the hidden state from the previous step. The gates within the LSTM utilize sigmoid activation functions to compute their values, ensuring that these values are between 0 and 1.

The input gate regulates the extent to which new information is allowed into the memory cell. It combines the current input $X_t$ with the previous hidden state $H_{t-1}$ through a series of weight matrices $W_{xi}, W_{hi}$ and a bias term $b_i$ , as shown in the equation $I_t = \sigma(X_t W_{xi} + H_{t-1} W_{hi} + b_i)$ .

The forget gate determines whether to retain or discard information from the memory cell's state. It combines the current input $X_t$ with the previous hidden state $H_{t-1}$ through a series of weight matrices $W_{xf}, W_{hf}$ and a bias term $b_f$ , as shown in the equation $F_t = \sigma(X_t W_{xf} + H_{t-1} W_{hf} + b_f)$ . This allows the cell to forget or remember information based on the current context and the cell's previous state.

Finally, the output gate influences whether the current state of the memory cell affects the output at that time step. It combines the current input $X_t$ with the previous hidden state $H_{t-1}$ through a series of weight matrices $W_{xo}, W_{ho}$ and a bias term $b_o$ , as shown in the equation $O_t = \sigma(X_t W_{xo} + H_{t-1} W_{ho} + b_o)$ to filter the internal state before contributing to the output.

These gates give LSTMs a significant advantage over standard RNNs. They allow LSTMs to make precise decisions about when to update or reset their hidden state based on the data they process. For instance, if an early piece of data is important, the LSTM learns to keep its hidden state unchanged for a while. It also learns to ignore irrelevant data points and reset the hidden state when needed.

Input node functionality

Input nodes

The design of memory cells in LSTM includes an element called the input node. The input node, sometimes referred to as the candidate cell state, actually generates the new information that could be added to the memory cell state. It computes a vector of new candidate values based on the current input and the previous hidden state, typically using a $\tanh$ activation function, which produces values between -1 and 1. The input gate then uses these candidate values produced by the input node and decides how much of this candidate value should be used to update the memory cell state. The input node's role is not just about deciding but actively creating the new values that could be potentially added to the memory cell.

At any given time step $t$ , the input node $\tilde{C_t}$ is calculated using the current input $X_t$ , the previous hidden state $H_{t-1}$ , weight matrices $W_{xc}$ and $W_{hc}$ , and a bias term $b_c$ . Specifically, the formula is $\tilde{C_t} = \tanh(X_tW_{xc} + H_{t-1}W_{hc} + b_c),$ This operation's result is a value for the input node that reflects both new input and the existing state of the memory cell.

By carefully managing its internal state through the input node and gates, the LSTM can maintain important information over time and sequences, allowing it to make predictions or decisions based on a complex mixture of new and previously received information.

Memory cell state mechanism

The memory cell state in an LSTM is updated through a careful balance of adding new information and removing outdated information. This is managed by two gates: the input gate $I_t$ and the forget gate $F_t$ . The input gate controls how much new information is added to the cell state, while the forget gate determines how much of the old state is retained.

Memory cell state

The update of the cell state $C_t$ at each time step involves two key operations:

The forget gate's output $F_t$ decides which parts of the previous cell state $C_{t-1}$ should be kept or discarded.
The input gate's output $I_t$ filters the new candidate values that may be added to the state.

The cell state can be expressed with the equation:

$C_t = F_t \odot C_{t-1} + I_t \odot \tilde{C_t}$

Here, the Hadamard product $\odot$ represents the element-wise multiplication, which is used to multiply the forget gate's output with the previous cell state and the input gate's output with the new candidate values. These two results are then summed to update the cell state to $C_t$ .

The adjustable nature of the input and forget gates gives the LSTM the ability to learn the optimal timing for updating the cell state in response to new inputs.

Role of the Hidden state

Hidden state

The hidden state $H_t$ is the output of the LSTM at each time step, which is shaped by the output gate $O_t$ and the memory cell state $C_t$ . To compute the hidden state, the memory cell state is first normalized by the hyperbolic tangent function, which scales the values to fall between -1 and 1. The corresponding equation is:

$H_t = O_t \odot \tanh(C_t)$

In this relationship, the output gate's role is to filter the information from the memory cell state that will contribute to the hidden state. The Hadamard product $\odot$ between the output of the gate and the normalized cell state ensures that only pertinent information influences the hidden state. If the output gate is close to 1, it permits a significant portion of the memory cell state to influence the hidden state. If the output gate is close to 0, it limits the influence of the memory cell state on the hidden state, preventing unnecessary or outdated information from passing through.

This selective passing of information allows the LSTM to maintain information across many time steps, releasing it when appropriate for the task at hand. This feature is essential for tasks that require careful timing of information output, such as in sequence prediction and language processing tasks.

Conclusion

LSTM networks are designed to process and remember information over long sequences, which is vital for tasks that involve time-related data. Here's a streamlined recap of their functionality:

LSTMs are equipped with a memory cell that carries relevant information through sequential steps, capturing necessary long-term data.
A set of gates regulates the memory cell's information. The input gate adds new, relevant information. The forget gate removes unnecessary information. The output gate filters the memory cell's information to produce the output.
This system of gates allows LSTMs to overcome issues like the vanishing gradient problem, making them efficient for complex tasks like language translation and stock market prediction.

How did you like the theory?

Report a typo