Computer scienceData scienceMachine learningIntroduction to deep learningRecurrent neural networks

Recurrent neural networks

8 minutes read

When you read a book, with every new word, you remember the context set by the previous ones. This ability to remember and use past information to understand the present is fundamental in how we comprehend language and sequences. In the world of artificial intelligence, Recurrent Neural Networks (RNNs) mirror this capability, making them suitable for handling sequential data.

In this topic, you'll learn about the unique features of RNNs. We'll explore their structure and understand how they generate outputs. Additionally, we'll discuss the challenges associated with RNNs.

Understanding sequential data through RNNs

When we interact with the world, our experiences are not isolated; they are part of a continuous stream of events. This is the essence of sequential data—it's data where the order matters. RNNs are particularly adept at handling such data, and to understand their functionality, let's consider the weekly routine of an individual named David as an example.

David's activities

David has a simple yet structured weekly exercise routine. On the first day, he goes running in the gym. On the second day, he rides his bicycle. On the third day, he goes swimming. This sequence repeats unless the weather interferes; if it's a rainy day, David decides to stay in and sleep instead of following through with his exercise. When the weather clears up, he resumes his activity cycle from where he left off.

David's Routine

This routine can be modeled using an RNN. The RNN takes into account not just the current day's weather but also the sequence of David's activities leading up to that day. On the fourth day, if it rains, the network remembers that David was swimming the day before. When the sun comes out on the fifth day, the RNN uses this remembered information to predict that David will resume his cycle by going running in the gym, as that's the activity following swimming in his routine.

RNN on David's Routine

On the second day, a rainy day leads David to sleep, and the function (or RNN) notes this change. On the third day, the weather is sunny again, and the RNN, remembering that David slept the previous day due to rain, predicts that he will go running, which is the next activity in his sequence after swimming.

This example shows how RNNs use past information (the sequence of activities and weather conditions) to make informed predictions about future events. It's a simple yet effective demonstration of how RNNs handle sequential data, providing a foundation for understanding more complex applications such as language processing, where each word's context is dependent on the previous words in a sentence.

Understanding recurrent neural networks

Recurrent Neural Networks

One of the defining features of RNNs is their looping structure. This allows them to retain a form of memory, carrying information from one step of the network to the next. Each output in an RNN is influenced by the previous computations, which is similar to the network having a memory of what it has processed so far. In the above diagram, a segment of the neural network, labeled A, examines an input xtx_t and generates an output hth_t. A loop mechanism is integrated within this structure, enabling the transfer of information from one stage of the network to the subsequent one.

In our example with David's routine, the hidden state is what allows the RNN to 'remember' what David did on the previous days and predict what he will do next. Even after a break in the routine due to rain, the hidden state ensures that the RNN can resume the activity sequence without starting from scratch.

The hidden state's ability to capture temporal dependencies is what gives RNNs their power in tasks like time series prediction, language modeling, and any other domain where the sequence and context matter. It is what makes the RNN 'recurrent', giving it the capability to perform complex tasks that require an understanding of the sequence history.

Unrolled Recurrent Neural Networks

The loops in recurrent neural networks might initially seem complex, but upon further reflection, they are quite similar to a standard neural network. Essentially, an RNN can be viewed as numerous copies of the same network, where each copy passes information to the next one. This process is easier to comprehend when we unroll the loop, showcasing the sequential flow of the network.

Model parameters in RNNs

Model parameters in RNNs determine how the network processes and learns from sequential data. These parameters come in the form of weight matrices and are central to the RNN's ability to perform tasks that involve making predictions over time.

In a RNN, each state hth_t is calculated using a recurrence formula that takes into account the previous hidden state ht1h_{t-1} and the current input xtx_t. This can be represented as:

ht=fw(ht1,xt)h_t = f_w(h_{t-1}, x_t)Here, fwf_w is a function parameterized by the weight matrix WW. The function combines the information from the past (the old state ht1h_{t-1}) and the present (the input vector xtx_t) to generate the new state hth_t.

RNN uses the following equations to update the hidden state and to produce the output:

ht=tanh(Whhht1+Wxhxt)h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t)yt=Whyhty_t = W_{hy}h_t

Let's look at a classic formulation of RNN where tanh\text{tanh} activation function is used as it helps in maintaining the numerical stability of the gradients throughout the learning process. The function is applied to a linear combination of the previous hidden state and the current input, both of which are multiplied by their respective weight matrices, WhhW_{hh} and WxhW_{xh}. The result is the new hidden state hth_t, which captures the current system's memory, encapsulating information not just from the immediate past ht1h_{t-1}, but from all previous steps. The matrix WhyW_{hy} contains the weights that transform the hidden state hth_t into the output. This memory is then utilized at each time step to produce the output yty_t, which, in the context of sequence prediction, could be the next word in a sentence or the next note in a piece of music. The output is often subjected to a softmax function to represent a probability distribution over possible next elements in the sequence.

These weight matrices — WhhW_{hh}, WxhW_{xh}, and WhyW_{hy} — are shared across all time steps.

Weight sharing analogy

Weight sharing is a technique where the same weights are used at each step of processing the sequence, as depicted by the repeated 'Box 1'. This means that instead of learning separate parameters for each time step, the network reuses the same parameters, effectively reducing the model's complexity and the number of parameters required. By reusing the same parameters, RNNs efficiently reduce the complexity of the model, avoiding the exponential growth of parameters that would occur if unique weights were used for each time step. This sharing is what allows RNNs to handle input sequences of various lengths, which is referred to as the "temporal size" of the sequences.

RNN weight sharing

Through these parameters and their recurrent application over time, RNNs can maintain a dynamic and evolving representation of the sequential information, allowing them to make informed predictions and decisions based on both recent and long-past inputs.

RNN architectures

RNN architectures

RNNs come in various architectures, each suited to a specific type of task based on the nature of the input and output sequences. The architectures can be broadly categorized into four types: one-to-one, one-to-many, many-to-one, and many-to-many.

One-to-One: This architecture represents the standard neural network model, where there is one input and one output. It's commonly used in tasks that involve classification, where a fixed-sized input, such as an image or a single data point, is mapped to a category or label.

One-to-Many: This type of architecture is useful for applications like image captioning, where a single input, such as an image, leads to a sequence of outputs, which in this case would be a series of words forming the caption. Here, the RNN needs to generate a dynamic length of output from a static length of input.

Many-to-One: The many-to-one architecture is typically used for sentiment analysis or any task that requires understanding the whole sequence to produce a single result. In sentiment analysis, for example, a series of words (the sequence of inputs) expresses a sentiment that is ultimately classified as positive, negative, or neutral (the single output).

Many-to-Many: There are two types of many-to-many RNN architectures. The first kind is used for tasks such as sequence tagging or labeling, where each input in a sequence corresponds to an output, often seen in part-of-speech tagging in natural language processing. The second kind of many-to-many architecture is seen in sequence-to-sequence tasks, such as machine translation, where an input sequence (a sentence in the source language) is converted into an output sequence (the translated sentence in the target language). This architecture processes the input sequence to capture the context and then generates the output sequence based on that context.

Each architecture integrates the strength of RNNs to handle sequential data, with the flexibility to accommodate various lengths and complexities of sequences.

Vanishing gradient

How was your day?

The vanishing gradient problem is a challenge in training RNNs effectively, especially when dealing with long sequences. Let's explore how this issue arises through the process of encoding a sequence in an RNN.

Consider the sequence of words "How was your day?". The RNN starts by processing the first word, "How". It encodes this word into a hidden state and outputs a result. In the next step, the word "was" is fed into the RNN along with the previous hidden state. Now the RNN holds information about both "How" and "was". This sequential processing continues until the last word is fed into the RNN.

By the time we reach the end of the sequence, the RNN has theoretically encoded information from all previous words. The final hidden state should contain the entire sequence's context, which can be used to predict an intent or classify a sentiment.

Final Hidden State

However, a look at the final hidden state's representation, as depicted in the accompanying illustration, reveals a peculiar color distribution. This is meant to symbolize the short-term memory issue caused by vanishing gradients. As the RNN progresses through the sequence, it struggles to retain information from the initial steps. The influence of the words "How" and "was" appears to diminish, becoming almost negligible by the final step.

This short-term memory problem stems from the nature of backpropagation. Backpropagation calculates the gradients for each node based on the error value and adjusts the weights accordingly. The gradient signifies how much the weights should change to reduce the error. However, during back-propagation, each layer's gradient is calculated with respect to the gradients of the preceding layer. If those previous gradients are small, the current layer's gradients become even smaller. As this process repeats for each layer in reverse, gradients can diminish to the point of being almost non-existent in early layers.

Vanishing gradient

As a result, the early layers of the network barely learn, with their weights remaining almost unchanged. This is problematic for RNNs as they are meant to learn from earlier steps in the sequence to predict later ones. The vanishing gradients make the RNN forget the earlier parts of the sequence, like "How" and "was", and force it to rely on more recent information like "your day?", which is vague without the full context.

This limitation hinders the RNN's ability to learn long-range dependencies, meaning it cannot effectively use early sequence information to make predictions or classifications. This problem leads to what is described as the RNN having a short-term memory, significantly impacting its performance on tasks that require understanding over longer sequences.

To mitigate the vanishing gradient problem, several techniques are employed. One common method is to use gated units, such as Long Short-Term Memory (LSTM) units or Gated Recurrent Units (GRUs), which are designed to have mechanisms to remember and forget information selectively, thus preserving gradients over longer sequences.

Conclusion

To summarize, Recurrent Neural Networks (RNNs) are well-suited for processing sequential data due to their hidden states that transfer information through the sequence. Shared weight matrices in RNNs allow for efficient parameter management. Yet, RNNs encounter the vanishing gradient problem, impacting their capacity to learn from longer sequences and leading to a phenomenon akin to short-term memory. Despite these challenges, RNNs remain useful for various applications, including language processing and time series analysis.

How did you like the theory?
Report a typo