How does the human mind handle time? It’s a question that feels both simple and impossibly complex. So much of what we do—from understanding a melody to catching a ball to having a conversation—depends on processing sequences of events as they unfold.
Language, in particular, is a river of information flowing through time. The meaning of a sentence isn’t just in the words themselves, but in their order. “Dog bites man” is ordinary news; “Man bites dog” is a headline.
For early artificial intelligence and neural network researchers, this problem of serial order was a major hurdle. Most models were built to look at a static piece of data—like a picture—and classify it. But how do you give a network a sense of history? How can it understand that what it’s seeing now is connected to what it saw a moment ago?
A common early approach was to cheat: represent time as space. Imagine taking a movie and laying out every single frame side-by-side, then trying to understand the plot by looking at the giant mosaic all at once. This “spatial representation” can work for short, fixed-length sequences, but it’s clumsy and brittle. It requires a huge input layer, can’t handle sequences of varying length (like sentences), and struggles to recognize the same pattern if it’s simply shifted in time.
In 1990, Jeffrey L. Elman published a landmark paper in Cognitive Science titled Finding Structure in Time. He introduced a beautifully simple architectural tweak that gave a standard neural network a form of memory. This network—now famously known as a Simple Recurrent Network (SRN) or an Elman Network—wasn’t just able to process sequences; it was able to discover hidden structures within them. By simply learning to predict the next word in a sentence, Elman’s network taught itself the building blocks of grammar, discovering concepts like “noun” and “verb” from scratch.
This article will take you on a deep dive into this classic work. We’ll explore the elegant idea behind the SRN, walk through the series of clever experiments Elman used to test its limits, and uncover how this research laid the foundation for the sequential models—such as LSTMs and Transformers—that dominate AI today.
A Network with a Memory
Before Elman, some researchers were already exploring recurrence. In 1986, Michael Jordan proposed a network architecture where the output of the network was fed back into a special set of “state” units on the next time step.
Figure 1: The architecture proposed by Jordan (1986). The output from one time step is used as context for the next. This was a key precursor to Elman’s model.
This feedback allowed the network’s next action to be influenced by its prior actions—a crucial step for tasks like motor control.
Elman made a subtle but profound change to this idea: instead of feeding back the network’s output (its visible action), he fed back its own internal state (its “thought”).
The Simple Recurrent Network (SRN), shown below, is almost identical to a standard feed-forward network, with one addition: context units.
Figure 2: The SRN. The key innovation is the feedback loop where the hidden layer’s activation at time t is copied to the context units, becoming part of the input at time t+1.
Here’s how it works:
- At time \(t\), an input (e.g., a vector representing a word) is presented to the input units.
- The input units, along with the context units, activate the hidden units. The hidden layer forms the network’s internal representation of the input.
- The hidden units activate the output units to produce a result.
- Before the next cycle begins, the hidden unit activation pattern is copied back—one-for-one—to the context units.
The context units act as the network’s short-term memory. At any moment, the hidden layer sees not only the current input but also a snapshot of its own previous activation. This means the network’s internal representation of an input is always shaped by what came before. Time isn’t an external dimension to be managed—it’s an integral part of the processing dynamics.
To train the network, Elman used a powerful learning signal: prediction. At every time step, the network was asked to predict the next item in the sequence. To predict tomorrow, you have to understand today.
From Simple Rhythms to Hidden Rules
Elman tested his network on a series of tasks, each more complex than the last, to see if this simple memory mechanism could uncover increasingly abstract structures.
Experiment 1: The Temporal XOR
The Exclusive-OR (XOR) problem is a classic benchmark. The task: given two bits, output 1 if they’re different, 0 if they’re the same.
Elman created a temporal version, producing a continuous stream of bits where every third bit was the XOR of the previous two. For example:
1, 0 → 1
0, 0 → 0
1, 1 → 0
This formed a stream: 1, 0, 1, 0, 0, 0, 1, 1, 0 ...
. The network saw one bit at a time and had to predict the next.
Figure 3: Average prediction error for the temporal XOR task. The error drops at the moments where the next bit is predictable (every third bit), showing the network has learned the dependency.
The repeating pattern of high-high-low error reveals that the network learned the XOR rule: unpredictability for the first two bits, predictability when the third (XOR) bit arrives. Importantly, the SRN’s memory—via context units—was enough to solve a problem requiring a two-step look-back.
Experiment 2: Finding Structure in Artificial Words
Next, Elman created a mini-language of consonants (b, d, g
) and vowels (a, i, u
). Each “letter” was represented by a 6-bit feature vector.
Table 1: The artificial alphabet, with each letter defined by six phonetic-style binary features.
He defined simple rules:
b → ba
d → dii
g → guuu
A random sequence of consonants was expanded according to these rules, producing streams like diibaguuubadi...
.
The SRN read the stream one letter at a time and predicted the next.
Figure 4: Prediction error over time. Spikes occur on unpredictable consonants; valleys occur during predictable vowel sequences.
The pattern makes sense: consonants occur randomly, vowels follow predictably. But the deep insight came from feature-wise error.
Figure 5(a): Error for the
Consonantal
feature (bit 1) stays low—after vowels, the network can reliably predict that a consonant is next.
Figure 5(b): Error for the
High
feature (bit 4) spikes when predicting consonants since they differ in this trait.
This shows the SRN wasn’t just memorizing sequences—it learned abstract rules about features, making partial predictions based on structural regularities.
Experiment 3: Discovering “Words” from Continuous Text
In speech, words aren’t separated by spaces. To simulate this, Elman concatenated simple sentences into one long stream of letters, with no spaces or punctuation. The network’s task was to predict the next letter.
Figure 6: Prediction error in the letter-in-word task. Peaks often correspond to word boundaries, where predictability is lowest.
Error peaks aligned with word boundaries: unpredictability is highest at the start of a word, and predictability increases as the word unfolds. This suggests the error signal could serve as a cue for learning segmentation—finding “word” boundaries without ever being told what a word is.
The Main Event: Grammar from Word Order
Finally, Elman tackled whole words. He created a simple grammar to generate thousands of short sentences from a 29-word vocabulary.
Table 3: Lexical categories used in the grammar. The network never saw these labels—it had to infer them.
Table 4: Sentence templates defining permissible word orders.
Each word was assigned a random one-hot vector—no built-in similarity. Words like man
and woman
were as different as rock
or eat
in their raw representation. Any grouping had to come from how they appeared in sentences.
Trained to predict the next word in the concatenated sentence stream, the network developed internal hidden-unit representations for each word. Elman averaged these across contexts (to get type-level representations) and performed hierarchical clustering.
Figure 7: The network’s conceptual map. Words are grouped by grammatical and semantic role—discovered purely from prediction learning.
The tree’s structure was strikingly linguistic:
- First split: Nouns vs Verbs.
- Nouns split into Animates vs Inanimates; animates split into Humans vs Animals.
- Verbs split by syntactic behavior: transitive, intransitive, optional objects.
The SRN had inferred the correlates of grammar from distributional statistics alone.
Types, Tokens, and Context
Figure 7 showed types: each category was the average of many tokens seen in different contexts. But SRN representations are always entwined with their preceding context. This means boy
in “boy chases…” differs from boy
in “…chases boy.”
Figure 9: Individual tokens of
BOY
andGIRL
cluster based on sentence position and preceding context.
Tokens cluster by similarity of situation—e.g., sentence-initial vs sentence-final—reflecting how context warps a word’s representation.
To emphasize this, Elman tried a novel word: zog
. He inserted zog
in all positions where man
could appear, and passed these sentences to the trained network (no retraining).
Figure 8: The novel word
ZOG
placed with human nouns, next toMAN
—based only on contextual usage.
The network classified zog
as a human noun, consistent with its distribution in sentences—mirroring how humans infer meaning for new words from context.
Conclusion: The Legacy of a Simple Idea
Finding Structure in Time showed that:
- Time can be represented implicitly in a network’s internal state, without spatial sequencing hacks.
- Prediction is a potent training signal, forcing a network to capture temporal structure.
- Grammatical categories emerge from distributional patterns, without explicit rules.
- Representations can be both categorical and context-sensitive, naturally distinguishing types and tokens.
The SRN was the ancestor of modern recurrent architectures like LSTMs and GRUs. Today’s Transformers solve sequence problems differently, but still wrestle with the same core challenge: integrating context into each element’s representation.
Elman’s simple tweak gave neural networks an inner voice that could remember, anticipate, and abstract. It was a step toward teaching machines not just to see, but to understand the flow of events—the very essence of experience.