For decades, neural networks have proven to be extraordinary pattern-recognition machines. They can classify images, translate languages, and even generate creative text. However, they’ve historically struggled with tasks that a first-year computer science student would find trivial—like copying a sequence of data, sorting a list, or performing associative recall.
Why? Because traditional neural networks, even powerful ones like LSTMs, lack a fundamental component of classical computers: an external, addressable memory. They have to cram all their knowledge into the weights of their neurons, which is like trying to do complex calculations using only a mental scratchpad.
What if we could bridge this gap? What if we could give a neural network access to a dedicated memory bank—much like a computer’s RAM—and let it learn how to read from and write to it?
This is the core idea behind the groundbreaking 2014 paper from Google DeepMind, Neural Turing Machines. The researchers introduced a novel architecture that couples a neural network with an external memory, creating a system that can learn simple algorithms purely from input-output examples.
The most brilliant part? The entire system is differentiable end-to-end. This means we can train it with our trusty tool—gradient descent—just like any other deep learning model. The network doesn’t just learn a task; it learns the procedure for solving that task.
In this article, we’ll explore how Neural Turing Machines (NTMs) work, the ingenious mechanisms that make them possible, and how they learn to perform tasks that were previously out of reach for neural networks.
The NTM Architecture: A CPU with Learnable RAM
At its heart, a Neural Turing Machine is composed of two main parts: a controller and a memory matrix.
You can think of the controller as the CPU. It’s a standard neural network—either a feedforward network or a recurrent network like an LSTM. It processes inputs, generates outputs, and decides how to interact with the memory.
The memory matrix acts like RAM. It’s a large, two-dimensional array of real numbers. The controller can only interact with this memory via read and write heads, analogous to the read/write heads in a classical Turing Machine.
The main challenge is making the interaction with memory differentiable. In a conventional computer, you read from a discrete address (0x1A
) or you don’t—that’s not compatible with gradient descent. NTMs solve this with attentional, “blurry” focus: instead of addressing one memory location, heads assign weights to all locations, determining how strongly each is read or written. This “soft” approach makes the whole process differentiable.
Reading from Memory
Reading doesn’t select one location; it computes a weighted average of all memory rows.
If \( M_t \) is the \(N \times M\) memory matrix at time \(t\) (with \(N\) rows and vector length \(M\)), the read head emits a normalized weighting vector \( w_t \):
\[ \sum_{i=0}^{N-1} w_t(i) = 1, \quad 0 \le w_t(i) \le 1 \]The returned read vector is:
\[ \mathbf{r}_t = \sum_{i=0}^{N-1} w_t(i) \mathbf{M}_t(i) \]If \( w_t \) is sharp (one element near 1, the rest near 0), the read resembles a discrete lookup. If blurry, it blends multiple locations.
Writing to Memory
Writing is a two-phase process: erase then add, inspired by LSTM gating.
Given a write weighting \( w_t \), erase vector \( \mathbf{e}_t \) (\(M\) values between 0 and 1), and add vector \( \mathbf{a}_t \):
Erase:
\[ \tilde{\mathbf{M}}_t(i) = \mathbf{M}_{t-1}(i) \left[ \mathbf{1} - w_t(i) \mathbf{e}_t \right] \]Only components with high weighting and high erase value are zeroed.
Add:
\[ \mathbf{M}_t(i) = \tilde{\mathbf{M}}_t(i) + w_t(i) \mathbf{a}_t \]
This fine-grained, continuous control allows selective updates while remaining differentiable.
Differentiable Addressing: The Secret Weapon
The big question: how does the controller produce the weighting vector \( w_t \)?
The answer: by combining content-based addressing and location-based addressing.
1. Focusing by Content
Content-based addressing searches memory by similarity. The controller emits a key vector \( \mathbf{k}_t \) and a key strength \( \beta_t \).
Similarity is measured (commonly via cosine similarity):
\[ K[\mathbf{u}, \mathbf{v}] = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \, ||\mathbf{v}||} \]Weights are computed as:
\[ w_t^{c}(i) = \frac{\exp\left(\beta_t K[\mathbf{k}_t, \mathbf{M}_t(i)]\right)}{\sum_j \exp\left(\beta_t K[\mathbf{k}_t, \mathbf{M}_t(j)]\right)} \]2. Focusing by Location
Sometimes sequential traversal is needed—location-based addressing enables iteration.
First, interpolate between the new content address \( w_t^c \) and the previous weighting \( w_{t-1} \) via a gate \( g_t \):
\[ \mathbf{w}_t^{g} = g_t \mathbf{w}_t^{c} + (1 - g_t) \mathbf{w}_{t-1} \]Then apply a rotational shift, via a shift weighting \( s_t \) over allowed integer shifts:
\[ \tilde{w}_t(i) = \sum_{j=0}^{N-1} w_t^{g}(j) \, s_t(i-j) \quad \text{(mod $N$)} \]Finally, sharpen with factor \( \gamma_t \ge 1 \):
\[ w_t(i) = \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j \tilde{w}_t(j)^{\gamma_t}} \]This triple mechanism can:
- Jump to a content match.
- Shift to a nearby location.
- Iterate without content lookup.
How Well Does It Work?
The Copy Task
The “Hello, World” of MANNs: show a sequence, then reproduce it after a delay.
Trained on sequences up to length 20, NTMs generalized to length 30, 50—even 120.
Inspecting memory accesses reveals the learned algorithm:
- Write loop: start location → write input → shift head forward.
- Read loop: reset to start → read output → shift forward.
Repeat Copy
Input: sequence + number of repetitions → output: that sequence repeated.
NTM generalizes to longer sequences and more repeats than seen in training.
Memory usage shows a nested loop: one write pass, multiple read passes, with head jumps back to start—effectively a goto
.
Associative Recall
Sequence of items; given one item, return the next.
Learned strategy: compress each item into one vector; store with content addressing; on query, recompute compression, find in memory, shift +1 to get the next item.
Priority Sort
Input: vectors with scalar priorities; output: top priorities in sorted order.
NTM mapped priorities to write addresses (high priority → higher index). Sorting became sequential reading from highest to lowest address.
Key Takeaways
The Neural Turing Machine was a landmark in neural network research:
- Differentiable Memory: Soft attention for reading/writing enables end-to-end gradient training.
- Hybrid Addressing: Combining content and location addressing supports diverse data structures and algorithms.
- Algorithmic Generalization: NTMs generalize learned algorithms far beyond training scope.
The NTM opened the door to Memory-Augmented Neural Networks, inspiring successors like the Differentiable Neural Computer (DNC). It bridged the gap between symbolic, rule-based AI and sub-symbolic, learned representations—showing that, with the right architecture, a neural network can learn to think more like a computer.