Recurrent Neural Networks (RNNs) are the workhorses of sequence modeling, powering everything from machine translation to text generation. A common strategy to boost their capability is to make them deep by stacking multiple recurrent layers. This stacked approach is intuitive: lower layers handle low-level, fast-changing features, while higher layers learn more abstract, slow-moving concepts. In conventional designs, information flows upward through the stack.
But what if this one-way street is too restrictive? What if the high-level understanding from an upper layer could provide crucial context to a lower layer?
Imagine a network writing a story: a high-level layer might track the overall plot point (the character is in danger), while a low-level layer generates the actual words. Knowing the plot point would be incredibly helpful for choosing the right vocabulary (“frantically,” “desperately”) at the character level.
This is the core idea behind the 2015 paper Gated-Feedback Recurrent Neural Networks. The researchers propose a new architecture—the Gated-Feedback RNN (GF-RNN)—that breaks the rigid upward-only flow of information in stacked RNNs. By adding connections from upper layers back down to lower ones—and, crucially, making these connections learnable gates—they created a more dynamic and powerful model that can adapt its internal information flow on the fly.
In this post, we’ll explore how GF-RNNs work, why they’re effective, and how they outperformed conventional models on challenging tasks like character-level language modeling and even evaluating Python code.
A Quick Refresher on RNNs
Before diving into GF-RNNs, let’s review the basics of RNNs and their advanced variants, LSTMs and GRUs.
An RNN processes a sequence one step at a time. At each timestep \(t\), it takes an input vector \(\mathbf{x}_t\) and the hidden state from the previous step, \(\mathbf{h}_{t-1}\), to compute a new hidden state \(\mathbf{h}_t\):
\[ \mathbf{h}_{t} = f\left(\mathbf{x}_{t}, \mathbf{h}_{t-1}\right) \]Typically, \(f\) is an affine transformation of both \(\mathbf{x}_t\) and \(\mathbf{h}_{t-1}\) followed by a non-linear activation, such as tanh
:
This recurrent loop lets the network maintain a “memory” of past information. However, standard RNNs struggle to learn long-term dependencies due to the vanishing gradient problem, where influences from distant past steps fade and become hard to capture.
Gated RNNs: LSTM and GRU
To address this, researchers developed architectures with gating mechanisms—notably the Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRU). These gates allow the network to explicitly control what to remember, forget, and output.
LSTMs feature:
- Forget Gate (\(f_t\)): Decides what information to discard from the cell state.
- Input Gate (\(i_t\)): Determines which new information to store.
- Output Gate (\(o_t\)): Controls what to reveal from the cell state as output.
The memory cell update balances old and new information:
\[ c_t^j = f_t^j c_{t-1}^j + i_t^j \tilde{c}_t^j \]GRUs simplify this, with only:
- Reset Gate (\(r_t\)): Controls how much of the past to forget when creating a new candidate state.
- Update Gate (\(z_t\)): Blends the old hidden state with the new candidate state.
GRU update rule:
\[ h_t^j = (1 - z_t^j)h_{t-1}^j + z_t^j \tilde{h}_t^j \]Both LSTMs and GRUs handle long-range dependencies more effectively than vanilla RNNs and form the building blocks of GF-RNN.
The Core Idea: Gated Feedback Connections
In conventional deep RNNs, stacked layers create a hierarchy: lower layers feed into higher layers at the same timestep. Information flows upward only.
The GF-RNN architecture changes this. As shown below, it adds connections from all layers at the previous timestep (\(t-1\)) to all layers at the current timestep (\(t\)), enabling feedback connections from upper layers to lower layers.
Figure 1. A visual comparison of a conventional stacked RNN (a) and the proposed Gated-Feedback RNN (b). Bullets indicate global reset gates controlling feedback connectivity.
Adaptive Feedback with Global Reset Gates
Simply adding these dense connections would risk chaotic or redundant information flow. The GF-RNN solves this with global reset gates, each controlling one connection between a pair of layers across timesteps.
A gate \(g^{i \to j}\) is a scalar between 0 and 1:
- 0: Shut off the connection entirely.
- 1: Fully open the connection.
Computed dynamically at each timestep:
\[ g^{i \to j} = \sigma \left( \mathbf{w}_g^{i \to j} \mathbf{h}_t^{j-1} + \mathbf{u}_g^{i \to j} \mathbf{h}_{t-1}^* \right) \]where \(\mathbf{h}_{t-1}^*\) is all hidden states from \(t-1\), and \(\mathbf{h}_t^{j-1}\) is the input to layer \(j\).
Integration with Different RNN Units
For tanh-RNNs: Gates directly modulate the contribution of all previous layers to the current hidden state:
\[ \mathbf{h}_{t}^{j} = \tanh\left(W^{j-1 \to j}\mathbf{h}_{t}^{j-1} + \sum_{i=1}^{L} g^{i \to j} U^{i \to j}\mathbf{h}_{t-1}^{i}\right) \]For LSTMs/GRUs: Internal unit gates (forget, input, reset, etc.) remain unmodified. Global gates only influence the computation of the candidate state or candidate cell memory.
LSTM candidate memory:
\[ \tilde{\mathbf{c}}_t^j = \tanh\left(W_c^{j-1\to j}\mathbf{h}_t^{j-1} + \sum_{i=1}^L g^{i\to j}U_c^{i\to j}\mathbf{h}_{t-1}^i\right) \]GRU candidate state:
\[ \tilde{\mathbf{h}}_t^j = \tanh\left(W^{j-1\to j}\mathbf{h}_t^{j-1} + \mathbf{r}_t^j \odot \sum_{i=1}^L g^{i\to j} U^{i\to j}\mathbf{h}_{t-1}^i\right) \]This design lets the GF-RNN learn when and how high-level context should influence lower-level processing across varying timescales.
Experiments
The authors tested GF-RNNs on two challenging sequence tasks:
- Character-Level Language Modeling – Predicting the next character in a large text corpus (Hutter Prize Wikipedia dataset), measured by Bits-Per-Character (BPC).
- Python Program Evaluation – Feeding Python code as character sequences and predicting the output of its
print
statement, testing logical and structural reasoning.
Language Modeling Results
The GF-RNN with GRU and LSTM units outperformed single-layer and stacked RNNs, even with similar parameter counts.
Table 2. GF-RNNs with GRU/LSTM units achieve the lowest BPC. The tanh-RNN did not benefit from the architecture.
GF-LSTM achieved 1.842 BPC vs. 1.868 for stacked LSTM. A larger GF-LSTM (same units per layer as stacked) reached 1.789 BPC.
Training efficiency was also improved:
Figure 2. GF-RNNs learn faster and achieve lower BPC than stacked RNNs.
Effect of Gating
Fixing all global reset gates to 1 (always open) degraded performance to BPC 1.854 for GF-LSTM—above stacked LSTM but worse than full GF-LSTM—proving adaptive gating is essential.
Qualitative Analysis: Text Generation
Seeded generation with Wikipedia XML markup showed GF-LSTM successfully closing XML tags (</username>
, </contributor>
), whereas stacked LSTM often failed.
Table 3. GF-LSTM better captures markup structure thanks to top-down context feedback.
A large 5-layer GF-LSTM achieved 1.58 BPC, then a state-of-the-art result:
Table 4. Large GF-LSTM outperforms previous best results.
Python Program Evaluation Results
Difficulty was controlled via nesting depth and target output length. GF-RNN consistently beat stacked RNNs across complexity levels.
Figure 3. GF-RNNs show the largest gains on the most complex programs.
Gap plots (c, f) reveal highest improvements at deep nesting or long outputs—where long-range reasoning is critical.
Conclusion and Implications
The GF-RNN is a powerful refinement of deep RNN architecture:
- Top-Down Feedback is Powerful – Higher layers can guide lower layers with crucial context, improving complex sequence modeling.
- Adaptive Gating is Key – Global reset gates enable the network to learn optimal information flow dynamically.
- Improved Efficiency – GF-RNNs train faster and achieve better results with equal capacity.
- Scaling Benefits – Gains are greatest on complex tasks with long-term dependencies.
This architecture challenges the rigid hierarchy of stacked RNNs by letting networks learn their internal connectivity patterns. The outcome: models that adapt structure on the fly to fit the data, demonstrating that flexible communication channels make neural networks far more capable for difficult, long-range reasoning tasks.