In the history of artificial intelligence, a handful of key ideas have acted as catalysts—transforming the field and opening up entirely new possibilities. In 1986, one such breakthrough arrived in the form of “Learning representations by back-propagating errors” by David Rumelhart, Geoffrey Hinton, and Ronald Williams.
This seminal paper introduced a powerful and elegant algorithm that taught neural networks how to learn from their mistakes: backpropagation.
Before this, training neural networks with multiple layers was a notoriously difficult problem. Researchers knew that deep networks should be more powerful than shallow ones, but they lacked a practical, general-purpose method to train them. Algorithms like the perceptron-convergence procedure could only handle linearly separable problems, while multilayer networks with hidden units had no effective way to adjust their weights—especially when the desired outputs of those hidden units were unknown.
This was the classic credit assignment problem: when a network makes an error, which of its many connections should be held responsible, and by how much should each be adjusted?
The backpropagation algorithm provided an answer. It offered a systematic way to assign credit (or blame) to every single connection, enabling deep networks to learn meaningful internal features. This was not just another training method—it was the key that unlocked the potential of deep learning.
The Architecture: Layered Feed-Forward Networks
The authors explain backpropagation in the context of the simplest deep network structure: a layered, feed-forward network.
Imagine a vertical stack:
- Input layer at the bottom.
- Any number of hidden layers in the middle.
- Output layer at the top.
Information flows upward only—no intra-layer connections and no feedback loops.
In the forward pass, data flows from inputs up to outputs, with each unit computing its activation based on signals from the layer below.
Step 1: Total Input to a Unit
For a unit \( j \), its total input \( x_j \) is the weighted sum of outputs \( y_i \) from units \( i \) in the preceding layer:
This is a simple linear combination. But linearity alone can’t model complex functions—so we apply a nonlinearity next.
Step 2: Apply Non-Linear Activation
The paper uses the sigmoid function, which maps any real input to the range (0, 1):
This “squashing” function allows neurons to represent smooth, graded activations while introducing non-linear capabilities. The process repeats layer-by-layer, transforming the input vector into the network’s final prediction.
The Learning Goal: Minimizing the Error
Once the network produces an output, we need to measure how far it is from the desired target. The authors use the sum of squared errors across all output units and all training examples:
Here:
- \( c \) indexes training cases.
- \( j \) indexes output units.
- \( y_j \) is the actual output.
- \( d_j \) is the desired output.
The aim is to adjust the weights so that \( E \) becomes as small as possible.
To find the best weights, the authors employ gradient descent—iteratively moving the weights in the opposite direction of the error gradient to reach a (hopefully global) minimum.
The Core Method: Backward Error Propagation
While the forward pass computes activations, the backward pass computes error signals for each unit, layer-by-layer, starting from the output layer. This is achieved via the chain rule in calculus.
Step 1: Error Derivative for Output Activations
Differentiating the error function with respect to the output \( y_j \):
This result is intuitive—the output error is just “prediction minus target.”
Step 2: Error with Respect to Total Input
We use the chain rule to find \(\partial E / \partial x_j\):
The derivative of the sigmoid is \( y_j(1 - y_j) \), yielding:
This value, often called the delta (\(\delta_j\)), is the error signal at the input of the unit.
Step 3: Gradient for Each Weight
The gradient for a weight \( w_{ji} \):
A connection changes more if:
- The sending unit is highly active.
- The receiving unit has a large error signal.
Step 4: Backpropagating Error to Hidden Layers
For a hidden unit \( i \), we don’t have a target output. Instead, its error is the sum of errors from units it connects to in the next layer, weighted by those connections:
Repeating Steps 2 and 3 allows the algorithm to propagate deltas backward layer-by-layer until reaching the input layer.
Step 5: Updating Weights
Once all gradients \(\partial E / \partial w\) are computed, weights are updated using gradient descent:
Here, \(\varepsilon\) is the learning rate. The authors enhance this with a momentum term:
Momentum (\(\alpha\)) helps accelerate learning and smooth oscillations.
Experiment 1: Detecting Symmetry
One test for backpropagation was the symmetry detection problem: determine whether a binary input pattern is symmetric about its center.
The trained network used two hidden units with symmetric, opposite weights for mirrored inputs. Symmetric inputs produce zero net input to hidden units, keeping them off; the output unit then turns on. Non-symmetric patterns activate one hidden unit, suppressing the output. This shows the algorithm discovering abstract, higher-order features.
Experiment 2: Learning Family Relationships
The authors also demonstrated backpropagation on symbolic data—two isomorphic family trees, one English and one Italian.
The network was given the first two elements of a triple (e.g., <Colin> <has-aunt>
) and had to output the correct third element.
Inspecting the hidden units’ receptive fields revealed distributed representations:
The hidden units captured:
- Unit 1: English vs. Italian distinction.
- Unit 2: Generation within the tree.
- Unit 6: Family branch.
This is representation learning in action: the network distilled raw input features into abstract concepts, enabling generalization.
Extension to Recurrent Networks
Backpropagation also extends to recurrent networks by “unfolding” them in time into equivalent feed-forward layers with shared weights.
This allowed training sequential models using the same principles.
Conclusion and Lasting Impact
Rumelhart, Hinton, and Williams concluded that backpropagation’s error surface might contain local minima, but in practice, these rarely trap the learning process seriously—especially when networks have more than the minimum required connections.
While not biologically realistic, the algorithm demonstrated that gradient descent in weight space can produce rich, internal feature representations. This laid the foundation for decades of research and modern deep learning.
The 1986 paper solved the credit assignment problem in a practical way and showed that networks could autonomously build useful abstractions from data. Every neural model today—from image classifiers to large language models—stands on this conceptual groundwork.
Backpropagation remains a beautiful example of how a clever application of calculus can teach a machine to learn. Without it, the deep learning revolution would likely have been delayed by decades.