Computer scienceData scienceMachine learningIntroduction to deep learningRecurrent neural networks

GRU

7 minutes read

Recurrent Neural Networks (RNNs) serve as a foundation for understanding sequential data, setting the stage for more advanced models that address their inherent limitations. While Long Short-Term Memory (LSTM) models have offered significant improvements, especially in dealing with the long dependency problem, the quest for efficiency has led to the development of the Gated Recurrent Unit (GRU).

In this topic, you'll learn about the GRU and its approach to processing sequential data. We will examine the architecture of GRU, focusing on its design that retains the ability to manage long-term dependencies with fewer parameters than LSTM. This exploration includes a detailed look at the roles of the reset and update gates, and the formulation of the hidden state, providing insights into why and how GRUs offer an effective alternative for certain applications.

The evolution of GRU

Traditional RNNs struggle with long-term dependencies, meaning they find it hard to maintain information from earlier inputs for use in later steps. This issue, known as the vanishing gradient problem, makes it difficult for RNNs to learn from data where the relevant information is separated by large gaps.

LSTMs were introduced to solve this problem by incorporating mechanisms like input, forget, and output gates. These gates help the model decide which information to store, discard, or pass through, enabling it to retain information over longer sequences. However, the complexity of LSTMs comes from these very mechanisms, as they require a significant amount of parameters and computational resources to manage the gates' operations.

GRUs, developed by Kyunghyun Cho et al. in 2014, were designed to simplify this architecture by merging the input and forget gates into a single update gate and combining the cell state and hidden state. This results in a more streamlined model that can capture dependencies over long sequences with fewer parameters and less computational overhead compared to LSTMs.

Reset gate and update gate

GRU employs two distinct gates to manage information throughout the sequence: the reset gate and the update gate. These gates function within the sigmoid activation range (0 to 1), shaping the neural network's memory and learning processes. They are essential for the GRU's ability to discern which parts of the past data are relevant to retain for future computations.

The reset gate, defined by the formula

R_t = \sigma(X_tW_{xr} + H_{t-1}W_{hr} + b_r)

where $\sigma$ denotes the sigmoid function, determines the worth of the previous hidden state for the current state's computation. The gate regulates the influence of past hidden states by applying weights $W_{xr}$ and $W_{hr}$ to the current input $X_t$ and the previous hidden state $H_{t-1}$ , respectively, with an added bias term $b_r$ . The output of the reset gate directs the formation of the candidate hidden state, allowing the network to discard non-essential historical information.

Conversely, the update gate, defined by

Z_t = \sigma(X_tW_{xz} + H_{t-1}W_{hz} + b_z)

integrates new information with the existing hidden state. It processes the current input and the past hidden state through its weights $W_{xz}$ and $W_{hz}$ , along with, a bias $b_z$ . Its primary task is to balance the old hidden state against the newly proposed state, enabling the GRU to maintain valuable information or adapt to new data as necessary.

This dual-gate mechanism allows the GRU to simplify the memory updating process compared to the LSTM, which uses a more complex system of three gates. By effectively controlling the flow of historical and new information, the GRU presents a more parameter-efficient option for tasks that require the modeling of sequential dependencies.

Candidate hidden state

The candidate hidden state within a GRU is a temporary value that suggests how the unit's memory might be updated at each step in the sequence. To calculate this value, the GRU combines the information from the current input with the adjusted previous hidden state, influenced by the reset gate. This computation determines how much past information the GRU will carry forward.

Mathematically, the candidate hidden state, represented as $\tilde{H}_t$ , is calculated using the formula:

\tilde{H}_t = \tanh(X_tW_{xh} + (R_t \odot H_{t-1})W_{hh} + b_h)

Here, $X_t$ signifies the input at the current time step, $W_{xh}$ and $W_{hh}$ are the weights connected to the input and the previous hidden state, $b_h$ is the bias term, and $\odot$ stands for the Hadamard product, or the element-wise multiplication. The reset gate $R_t$ uses a sigmoid activation to control the extent to which the previous hidden state $H_{t-1}$ should contribute to the candidate hidden state.

The tanh activation function then transforms this result into a value that falls within a -1 to 1 range, allowing for a controlled adjustment to the state that accounts for both nonlinear characteristics of the input data and the need to keep gradients in check during learning.

The candidate hidden state is termed 'candidate' because it is not the final update to the unit's memory. It is subject to the update gate's assessment, which will determine how much of this candidate state will be used to update the GRU's memory to the new hidden state $H_t$ . The GRU's design, particularly the interaction between the reset gate, the update gate, and the candidate hidden state, enables it to flexibly handle varying lengths of data sequences, determining at each step what to remember and what to discard.

Hidden state

Hidden state The hidden state in a GRU reflects the memory of the network at a particular time step, combined from previous information and the current inputs. It is updated by the output of the update gate, which determines how much the new state should resemble the previous state or the candidate hidden state. The update gate thus enables the network to decide how much past information to carry forward and how much to integrate from the present.

The update equation for the hidden state $H_t$ is as follows:

H_t = Z_t \odot H_{t-1} + (1 - Z_t) \odot \tilde{H}_t

Here, $Z_t$ is the update gate's output, $\odot$ is the element-wise multiplication, $H_{t-1}$ is the previous-hidden state, and $\tilde{H}_t$ is the candidate hidden state. When $Z_t$ is close to 1, the GRU tends to keep the old state, thus disregarding the current input. Conversely, when $Z_t$ is close to 0, the new hidden state $H_t$ is formed more heavily from the candidate hidden state, suggesting a significant update that allows for the inclusion of new information.

The resulting hidden state $H_t$ is an element-wise convex combination of $H_{t-1}$ and $\tilde{H}_t$ , providing a balanced mix of historical and current data. This balance ensures that the GRU can adapt its memory content to reflect relevant information at each step in a sequence. This flexibility allows the update gates to address both short-term and long-term dependencies in the data.

Conclusion

The distinguishing features of Gated Recurrent Units (GRUs) lie in their two types of gates, which are instrumental in processing sequences. The reset gates are adept at capturing short-term dependencies, allowing the network to dynamically forget or remember information at each time step. On the other hand, the update gates are key to retaining long-term dependencies, giving the GRU the ability to carry forward relevant information through longer sequences. Together, these mechanisms allow GRUs to effectively manage the flow of information through time, making them a valuable tool for sequential data analysis and prediction.

How did you like the theory?

Report a typo

GRU

The evolution of GRU

Reset gate and update gate

Candidate hidden state

Hidden state

Conclusion

Related topics