For decades, neural networks have proven to be extraordinary pattern-recognition machines. They can classify images, translate languages, and even generate creative text. However, they’ve historically struggled with tasks that a first-year computer science student would find trivial—like copying a sequence of data, sorting a list, or performing associative recall.

Why? Because traditional neural networks, even powerful ones like LSTMs, lack a fundamental component of classical computers: an external, addressable memory. They have to cram all their knowledge into the weights of their neurons, which is like trying to do complex calculations using only a mental scratchpad.

What if we could bridge this gap? What if we could give a neural network access to a dedicated memory bank—much like a computer’s RAM—and let it learn how to read from and write to it?

This is the core idea behind the groundbreaking 2014 paper from Google DeepMind, Neural Turing Machines. The researchers introduced a novel architecture that couples a neural network with an external memory, creating a system that can learn simple algorithms purely from input-output examples.

The most brilliant part? The entire system is differentiable end-to-end. This means we can train it with our trusty tool—gradient descent—just like any other deep learning model. The network doesn’t just learn a task; it learns the procedure for solving that task.

In this article, we’ll explore how Neural Turing Machines (NTMs) work, the ingenious mechanisms that make them possible, and how they learn to perform tasks that were previously out of reach for neural networks.


The NTM Architecture: A CPU with Learnable RAM

At its heart, a Neural Turing Machine is composed of two main parts: a controller and a memory matrix.

A high-level diagram of the Neural Turing Machine architecture, showing a central controller interacting with an external memory matrix via read and write heads.

You can think of the controller as the CPU. It’s a standard neural network—either a feedforward network or a recurrent network like an LSTM. It processes inputs, generates outputs, and decides how to interact with the memory.

The memory matrix acts like RAM. It’s a large, two-dimensional array of real numbers. The controller can only interact with this memory via read and write heads, analogous to the read/write heads in a classical Turing Machine.

The main challenge is making the interaction with memory differentiable. In a conventional computer, you read from a discrete address (0x1A) or you don’t—that’s not compatible with gradient descent. NTMs solve this with attentional, “blurry” focus: instead of addressing one memory location, heads assign weights to all locations, determining how strongly each is read or written. This “soft” approach makes the whole process differentiable.


Reading from Memory

Reading doesn’t select one location; it computes a weighted average of all memory rows.

If \( M_t \) is the \(N \times M\) memory matrix at time \(t\) (with \(N\) rows and vector length \(M\)), the read head emits a normalized weighting vector \( w_t \):

\[ \sum_{i=0}^{N-1} w_t(i) = 1, \quad 0 \le w_t(i) \le 1 \]

The returned read vector is:

\[ \mathbf{r}_t = \sum_{i=0}^{N-1} w_t(i) \mathbf{M}_t(i) \]

If \( w_t \) is sharp (one element near 1, the rest near 0), the read resembles a discrete lookup. If blurry, it blends multiple locations.


Writing to Memory

Writing is a two-phase process: erase then add, inspired by LSTM gating.

Given a write weighting \( w_t \), erase vector \( \mathbf{e}_t \) (\(M\) values between 0 and 1), and add vector \( \mathbf{a}_t \):

  1. Erase:

    \[ \tilde{\mathbf{M}}_t(i) = \mathbf{M}_{t-1}(i) \left[ \mathbf{1} - w_t(i) \mathbf{e}_t \right] \]

    Only components with high weighting and high erase value are zeroed.

  2. Add:

    \[ \mathbf{M}_t(i) = \tilde{\mathbf{M}}_t(i) + w_t(i) \mathbf{a}_t \]

This fine-grained, continuous control allows selective updates while remaining differentiable.


Differentiable Addressing: The Secret Weapon

The big question: how does the controller produce the weighting vector \( w_t \)?

The answer: by combining content-based addressing and location-based addressing.

Flow diagram showing the addressing mechanism, from content addressing to interpolation, convolutional shift, and sharpening.


1. Focusing by Content

Content-based addressing searches memory by similarity. The controller emits a key vector \( \mathbf{k}_t \) and a key strength \( \beta_t \).

Similarity is measured (commonly via cosine similarity):

\[ K[\mathbf{u}, \mathbf{v}] = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \, ||\mathbf{v}||} \]

Weights are computed as:

\[ w_t^{c}(i) = \frac{\exp\left(\beta_t K[\mathbf{k}_t, \mathbf{M}_t(i)]\right)}{\sum_j \exp\left(\beta_t K[\mathbf{k}_t, \mathbf{M}_t(j)]\right)} \]

2. Focusing by Location

Sometimes sequential traversal is needed—location-based addressing enables iteration.

First, interpolate between the new content address \( w_t^c \) and the previous weighting \( w_{t-1} \) via a gate \( g_t \):

\[ \mathbf{w}_t^{g} = g_t \mathbf{w}_t^{c} + (1 - g_t) \mathbf{w}_{t-1} \]

Then apply a rotational shift, via a shift weighting \( s_t \) over allowed integer shifts:

\[ \tilde{w}_t(i) = \sum_{j=0}^{N-1} w_t^{g}(j) \, s_t(i-j) \quad \text{(mod $N$)} \]

Finally, sharpen with factor \( \gamma_t \ge 1 \):

\[ w_t(i) = \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j \tilde{w}_t(j)^{\gamma_t}} \]

This triple mechanism can:

  • Jump to a content match.
  • Shift to a nearby location.
  • Iterate without content lookup.

How Well Does It Work?

The Copy Task

The “Hello, World” of MANNs: show a sequence, then reproduce it after a delay.

Learning curves for the Copy task: NTMs learn faster and achieve lower error than LSTM.

Trained on sequences up to length 20, NTMs generalized to length 30, 50—even 120.

NTM generalization on Copy: accurate reproduction far beyond training lengths. LSTM generalization on Copy: rapid degradation beyond length 20.

Inspecting memory accesses reveals the learned algorithm:

  1. Write loop: start location → write input → shift head forward.
  2. Read loop: reset to start → read output → shift forward.

Visualization of NTM memory usage during Copy: sequential read/write heads.


Repeat Copy

Input: sequence + number of repetitions → output: that sequence repeated.

Repeat Copy learning curves: NTMs learn faster than LSTM.

NTM generalizes to longer sequences and more repeats than seen in training.

NTM vs LSTM generalization on Repeat Copy.

Memory usage shows a nested loop: one write pass, multiple read passes, with head jumps back to start—effectively a goto.

NTM memory usage during Repeat Copy: multiple passes over stored data.


Associative Recall

Sequence of items; given one item, return the next.

Associative Recall learning curves: NTM (red, green) far ahead of LSTM (blue). Generalization: NTMs handle much longer item sequences.

Learned strategy: compress each item into one vector; store with content addressing; on query, recompute compression, find in memory, shift +1 to get the next item.

Memory usage: content lookup + shift to next item.


Priority Sort

Input: vectors with scalar priorities; output: top priorities in sorted order.

Priority Sort learning curves: NTMs outperform LSTM significantly.

NTM mapped priorities to write addresses (high priority → higher index). Sorting became sequential reading from highest to lowest address.

Example Priority Sort input/target. Write addresses ~ linear in priority; sequential read addresses.


Key Takeaways

The Neural Turing Machine was a landmark in neural network research:

  • Differentiable Memory: Soft attention for reading/writing enables end-to-end gradient training.
  • Hybrid Addressing: Combining content and location addressing supports diverse data structures and algorithms.
  • Algorithmic Generalization: NTMs generalize learned algorithms far beyond training scope.

The NTM opened the door to Memory-Augmented Neural Networks, inspiring successors like the Differentiable Neural Computer (DNC). It bridged the gap between symbolic, rule-based AI and sub-symbolic, learned representations—showing that, with the right architecture, a neural network can learn to think more like a computer.