The Algorithm That Taught Neural Networks to Learn: Backpropagation's 1986 Breakthrough

In the history of artificial intelligence, a handful of key ideas have acted as catalysts—transforming the field and opening up entirely new possibilities. In 1986, one such breakthrough arrived in the form of “Learning representations by back-propagating errors” by David Rumelhart, Geoffrey Hinton, and Ronald Williams.

This seminal paper introduced a powerful and elegant algorithm that taught neural networks how to learn from their mistakes: backpropagation.

Before this, training neural networks with multiple layers was a notoriously difficult problem. Researchers knew that deep networks should be more powerful than shallow ones, but they lacked a practical, general-purpose method to train them. Algorithms like the perceptron-convergence procedure could only handle linearly separable problems, while multilayer networks with hidden units had no effective way to adjust their weights—especially when the desired outputs of those hidden units were unknown.

This was the classic credit assignment problem: when a network makes an error, which of its many connections should be held responsible, and by how much should each be adjusted?

The backpropagation algorithm provided an answer. It offered a systematic way to assign credit (or blame) to every single connection, enabling deep networks to learn meaningful internal features. This was not just another training method—it was the key that unlocked the potential of deep learning.

The Architecture: Layered Feed-Forward Networks

The authors explain backpropagation in the context of the simplest deep network structure: a layered, feed-forward network.

Imagine a vertical stack:

Input layer at the bottom.
Any number of hidden layers in the middle.
Output layer at the top.

Information flows upward only—no intra-layer connections and no feedback loops.

In the forward pass, data flows from inputs up to outputs, with each unit computing its activation based on signals from the layer below.

Step 1: Total Input to a Unit

For a unit \( j \), its total input \( x_j \) is the weighted sum of outputs \( y_i \) from units \( i \) in the preceding layer:

The formula for the total input to a neuron, which is the sum of the outputs of connected neurons multiplied by their respective weights.

This is a simple linear combination. But linearity alone can’t model complex functions—so we apply a nonlinearity next.

Step 2: Apply Non-Linear Activation

The paper uses the sigmoid function, which maps any real input to the range (0, 1):

The sigmoid activation function, which transforms the total input x_j into an output y_j between 0 and 1.

This “squashing” function allows neurons to represent smooth, graded activations while introducing non-linear capabilities. The process repeats layer-by-layer, transforming the input vector into the network’s final prediction.

The Learning Goal: Minimizing the Error

Once the network produces an output, we need to measure how far it is from the desired target. The authors use the sum of squared errors across all output units and all training examples:

The total error E is half the sum, over all training cases c and all output units j, of the squared difference between the actual output y and the desired output d.

Here:

\( c \) indexes training cases.
\( j \) indexes output units.
\( y_j \) is the actual output.
\( d_j \) is the desired output.

The aim is to adjust the weights so that \( E \) becomes as small as possible.

To find the best weights, the authors employ gradient descent—iteratively moving the weights in the opposite direction of the error gradient to reach a (hopefully global) minimum.

The Core Method: Backward Error Propagation

While the forward pass computes activations, the backward pass computes error signals for each unit, layer-by-layer, starting from the output layer. This is achieved via the chain rule in calculus.

Step 1: Error Derivative for Output Activations

Differentiating the error function with respect to the output \( y_j \):

The partial derivative of the error E with respect to an output unit’s activation y_j is simply the difference between the actual output y_j and the desired output d_j.

This result is intuitive—the output error is just “prediction minus target.”

Step 2: Error with Respect to Total Input

We use the chain rule to find \(\partial E / \partial x_j\):

The chain rule is used to find the derivative of the error with respect to the total input x_j.

The derivative of the sigmoid is \( y_j(1 - y_j) \), yielding:

The final expression for the derivative of the error with respect to the total input x_j of an output unit.

This value, often called the delta (\(\delta_j\)), is the error signal at the input of the unit.

Step 3: Gradient for Each Weight

The gradient for a weight \( w_{ji} \):

The derivative of the error with respect to a weight w_ji is the product of the error at the unit’s input and the output y_i of the unit sending the signal.

A connection changes more if:

The sending unit is highly active.
The receiving unit has a large error signal.

Step 4: Backpropagating Error to Hidden Layers

For a hidden unit \( i \), we don’t have a target output. Instead, its error is the sum of errors from units it connects to in the next layer, weighted by those connections:

The error derivative for a hidden unit’s output is the weighted sum of the error signals from the next layer.

Repeating Steps 2 and 3 allows the algorithm to propagate deltas backward layer-by-layer until reaching the input layer.

Step 5: Updating Weights

Once all gradients \(\partial E / \partial w\) are computed, weights are updated using gradient descent:

The basic weight update rule for gradient descent, where epsilon is the learning rate.

Here, \(\varepsilon\) is the learning rate. The authors enhance this with a momentum term:

The weight update rule with a momentum term, where alpha controls the influence of the previous weight change.

Momentum (\(\alpha\)) helps accelerate learning and smooth oscillations.

Experiment 1: Detecting Symmetry

One test for backpropagation was the symmetry detection problem: determine whether a binary input pattern is symmetric about its center.

Figure 1: A network trained with backpropagation to detect mirror symmetry. The weights and biases reveal a clever strategy where symmetric patterns cancel each other out.

The trained network used two hidden units with symmetric, opposite weights for mirrored inputs. Symmetric inputs produce zero net input to hidden units, keeping them off; the output unit then turns on. Non-symmetric patterns activate one hidden unit, suppressing the output. This shows the algorithm discovering abstract, higher-order features.

Experiment 2: Learning Family Relationships

The authors also demonstrated backpropagation on symbolic data—two isomorphic family trees, one English and one Italian.

Figure 2: Two isomorphic family trees used for the training data. The network’s task was to learn relationships like “Colin has-aunt ?”.

The network was given the first two elements of a triple (e.g., <Colin> <has-aunt>) and had to output the correct third element.

Figure 3: The five-layer network architecture used for the family tree task. It learns distributed representations of people and relationships in its hidden layers.

Inspecting the hidden units’ receptive fields revealed distributed representations:

Figure 4: A visualization of the weights from the input units (people) to six hidden units. The patterns show that the units have learned to represent abstract features.

The hidden units captured:

Unit 1: English vs. Italian distinction.
Unit 2: Generation within the tree.
Unit 6: Family branch.

This is representation learning in action: the network distilled raw input features into abstract concepts, enabling generalization.

Extension to Recurrent Networks

Backpropagation also extends to recurrent networks by “unfolding” them in time into equivalent feed-forward layers with shared weights.

Figure 5: The equivalence between a recurrent network run for three time steps and a three-layer feed-forward network with shared weights.

This allowed training sequential models using the same principles.

Conclusion and Lasting Impact

Rumelhart, Hinton, and Williams concluded that backpropagation’s error surface might contain local minima, but in practice, these rarely trap the learning process seriously—especially when networks have more than the minimum required connections.

While not biologically realistic, the algorithm demonstrated that gradient descent in weight space can produce rich, internal feature representations. This laid the foundation for decades of research and modern deep learning.

The 1986 paper solved the credit assignment problem in a practical way and showed that networks could autonomously build useful abstractions from data. Every neural model today—from image classifiers to large language models—stands on this conceptual groundwork.

Backpropagation remains a beautiful example of how a clever application of calculus can teach a machine to learn. Without it, the deep learning revolution would likely have been delayed by decades.

The Architecture: Layered Feed-Forward Networks#

Step 1: Total Input to a Unit#

Step 2: Apply Non-Linear Activation#

The Learning Goal: Minimizing the Error#

The Core Method: Backward Error Propagation#

Step 1: Error Derivative for Output Activations#

Step 2: Error with Respect to Total Input#

Step 3: Gradient for Each Weight#

Step 4: Backpropagating Error to Hidden Layers#

Step 5: Updating Weights#

Experiment 1: Detecting Symmetry#

Experiment 2: Learning Family Relationships#

Extension to Recurrent Networks#

Conclusion and Lasting Impact#