Unfolding Time: How a Simple Neural Network Learned the Rules of Language

How does the human mind handle time? It’s a question that feels both simple and impossibly complex. So much of what we do—from understanding a melody to catching a ball to having a conversation—depends on processing sequences of events as they unfold.

Language, in particular, is a river of information flowing through time. The meaning of a sentence isn’t just in the words themselves, but in their order. “Dog bites man” is ordinary news; “Man bites dog” is a headline.

For early artificial intelligence and neural network researchers, this problem of serial order was a major hurdle. Most models were built to look at a static piece of data—like a picture—and classify it. But how do you give a network a sense of history? How can it understand that what it’s seeing now is connected to what it saw a moment ago?

A common early approach was to cheat: represent time as space. Imagine taking a movie and laying out every single frame side-by-side, then trying to understand the plot by looking at the giant mosaic all at once. This “spatial representation” can work for short, fixed-length sequences, but it’s clumsy and brittle. It requires a huge input layer, can’t handle sequences of varying length (like sentences), and struggles to recognize the same pattern if it’s simply shifted in time.

In 1990, Jeffrey L. Elman published a landmark paper in Cognitive Science titled Finding Structure in Time. He introduced a beautifully simple architectural tweak that gave a standard neural network a form of memory. This network—now famously known as a Simple Recurrent Network (SRN) or an Elman Network—wasn’t just able to process sequences; it was able to discover hidden structures within them. By simply learning to predict the next word in a sentence, Elman’s network taught itself the building blocks of grammar, discovering concepts like “noun” and “verb” from scratch.

This article will take you on a deep dive into this classic work. We’ll explore the elegant idea behind the SRN, walk through the series of clever experiments Elman used to test its limits, and uncover how this research laid the foundation for the sequential models—such as LSTMs and Transformers—that dominate AI today.

A Network with a Memory

Before Elman, some researchers were already exploring recurrence. In 1986, Michael Jordan proposed a network architecture where the output of the network was fed back into a special set of “state” units on the next time step.

A diagram of the Jordan network architecture, showing input, hidden, and output layers, with a feedback loop from the output layer to a set of state units that also feed into the hidden layer.

Figure 1: The architecture proposed by Jordan (1986). The output from one time step is used as context for the next. This was a key precursor to Elman’s model.

This feedback allowed the network’s next action to be influenced by its prior actions—a crucial step for tasks like motor control.

Elman made a subtle but profound change to this idea: instead of feeding back the network’s output (its visible action), he fed back its own internal state (its “thought”).

The Simple Recurrent Network (SRN), shown below, is almost identical to a standard feed-forward network, with one addition: context units.

A diagram of the Simple Recurrent Network (SRN) or Elman Network. It shows input units and context units both feeding into a hidden layer, which then feeds to an output layer. A crucial recurrent loop copies the hidden layer’s activation back to the context units.

Figure 2: The SRN. The key innovation is the feedback loop where the hidden layer’s activation at time t is copied to the context units, becoming part of the input at time t+1.

Here’s how it works:

At time \(t\), an input (e.g., a vector representing a word) is presented to the input units.
The input units, along with the context units, activate the hidden units. The hidden layer forms the network’s internal representation of the input.
The hidden units activate the output units to produce a result.
Before the next cycle begins, the hidden unit activation pattern is copied back—one-for-one—to the context units.

The context units act as the network’s short-term memory. At any moment, the hidden layer sees not only the current input but also a snapshot of its own previous activation. This means the network’s internal representation of an input is always shaped by what came before. Time isn’t an external dimension to be managed—it’s an integral part of the processing dynamics.

To train the network, Elman used a powerful learning signal: prediction. At every time step, the network was asked to predict the next item in the sequence. To predict tomorrow, you have to understand today.

From Simple Rhythms to Hidden Rules

Elman tested his network on a series of tasks, each more complex than the last, to see if this simple memory mechanism could uncover increasingly abstract structures.

Experiment 1: The Temporal XOR

The Exclusive-OR (XOR) problem is a classic benchmark. The task: given two bits, output 1 if they’re different, 0 if they’re the same.

Elman created a temporal version, producing a continuous stream of bits where every third bit was the XOR of the previous two. For example:

1, 0 → 1
0, 0 → 0
1, 1 → 0

This formed a stream: 1, 0, 1, 0, 0, 0, 1, 1, 0 .... The network saw one bit at a time and had to predict the next.

A line graph showing the network’s prediction error over 12 time steps. The error follows a repeating pattern, dipping low every third step and rising for the two steps in between.

Figure 3: Average prediction error for the temporal XOR task. The error drops at the moments where the next bit is predictable (every third bit), showing the network has learned the dependency.

The repeating pattern of high-high-low error reveals that the network learned the XOR rule: unpredictability for the first two bits, predictability when the third (XOR) bit arrives. Importantly, the SRN’s memory—via context units—was enough to solve a problem requiring a two-step look-back.

Experiment 2: Finding Structure in Artificial Words

Next, Elman created a mini-language of consonants (b, d, g) and vowels (a, i, u). Each “letter” was represented by a 6-bit feature vector.

A table defining the 6-bit binary vectors for the letters b, d, g, a, i, u, based on phonetic-like features.

Table 1: The artificial alphabet, with each letter defined by six phonetic-style binary features.

He defined simple rules:

b → ba
d → dii
g → guuu

A random sequence of consonants was expanded according to these rules, producing streams like diibaguuubadi....

The SRN read the stream one letter at a time and predicted the next.

A graph of the network’s prediction error over a segment of the letter sequence. The error is high when predicting consonants (d, g, b) and low when predicting the subsequent vowels (a, i, u).

Figure 4: Prediction error over time. Spikes occur on unpredictable consonants; valleys occur during predictable vowel sequences.

The pattern makes sense: consonants occur randomly, vowels follow predictably. But the deep insight came from feature-wise error.

A graph showing the prediction error for just the “Consonantal” feature. The error is consistently low across the sequence.

Figure 5(a): Error for the Consonantal feature (bit 1) stays low—after vowels, the network can reliably predict that a consonant is next.

A graph showing the prediction error for just the “High” feature. The error spikes when predicting consonants.

Figure 5(b): Error for the High feature (bit 4) spikes when predicting consonants since they differ in this trait.

This shows the SRN wasn’t just memorizing sequences—it learned abstract rules about features, making partial predictions based on structural regularities.

Experiment 3: Discovering “Words” from Continuous Text

In speech, words aren’t separated by spaces. To simulate this, Elman concatenated simple sentences into one long stream of letters, with no spaces or punctuation. The network’s task was to predict the next letter.

A graph of the prediction error as the network processes a continuous stream of letters. The error spikes at the beginning of each word and decreases as the word unfolds.

Figure 6: Prediction error in the letter-in-word task. Peaks often correspond to word boundaries, where predictability is lowest.

Error peaks aligned with word boundaries: unpredictability is highest at the start of a word, and predictability increases as the word unfolds. This suggests the error signal could serve as a cue for learning segmentation—finding “word” boundaries without ever being told what a word is.

The Main Event: Grammar from Word Order

Finally, Elman tackled whole words. He created a simple grammar to generate thousands of short sentences from a 29-word vocabulary.

A table listing the categories of nouns and verbs used to generate sentences, with examples for each.

Table 3: Lexical categories used in the grammar. The network never saw these labels—it had to infer them.

A table showing the sentence templates used to generate the training data.

Table 4: Sentence templates defining permissible word orders.

Each word was assigned a random one-hot vector—no built-in similarity. Words like man and woman were as different as rock or eat in their raw representation. Any grouping had to come from how they appeared in sentences.

Trained to predict the next word in the concatenated sentence stream, the network developed internal hidden-unit representations for each word. Elman averaged these across contexts (to get type-level representations) and performed hierarchical clustering.

A dendrogram showing the hierarchical clustering of the network’s internal representations for all 29 words.

Figure 7: The network’s conceptual map. Words are grouped by grammatical and semantic role—discovered purely from prediction learning.

The tree’s structure was strikingly linguistic:

First split: Nouns vs Verbs.
Nouns split into Animates vs Inanimates; animates split into Humans vs Animals.
Verbs split by syntactic behavior: transitive, intransitive, optional objects.

The SRN had inferred the correlates of grammar from distributional statistics alone.

Types, Tokens, and Context

Figure 7 showed types: each category was the average of many tokens seen in different contexts. But SRN representations are always entwined with their preceding context. This means boy in “boy chases…” differs from boy in “…chases boy.”

Two dendrograms showing the clustering of individual tokens of BOY and GIRL based on context.

Figure 9: Individual tokens of BOY and GIRL cluster based on sentence position and preceding context.

Tokens cluster by similarity of situation—e.g., sentence-initial vs sentence-final—reflecting how context warps a word’s representation.

To emphasize this, Elman tried a novel word: zog. He inserted zog in all positions where man could appear, and passed these sentences to the trained network (no retraining).

A dendrogram including the new word ZOG in the word clustering.

Figure 8: The novel word ZOG placed with human nouns, next to MAN—based only on contextual usage.

The network classified zog as a human noun, consistent with its distribution in sentences—mirroring how humans infer meaning for new words from context.

Conclusion: The Legacy of a Simple Idea

Finding Structure in Time showed that:

Time can be represented implicitly in a network’s internal state, without spatial sequencing hacks.
Prediction is a potent training signal, forcing a network to capture temporal structure.
Grammatical categories emerge from distributional patterns, without explicit rules.
Representations can be both categorical and context-sensitive, naturally distinguishing types and tokens.

The SRN was the ancestor of modern recurrent architectures like LSTMs and GRUs. Today’s Transformers solve sequence problems differently, but still wrestle with the same core challenge: integrating context into each element’s representation.

Elman’s simple tweak gave neural networks an inner voice that could remember, anticipate, and abstract. It was a step toward teaching machines not just to see, but to understand the flow of events—the very essence of experience.

A Network with a Memory#

From Simple Rhythms to Hidden Rules#

Experiment 1: The Temporal XOR#

Experiment 2: Finding Structure in Artificial Words#

Experiment 3: Discovering “Words” from Continuous Text#

The Main Event: Grammar from Word Order#

Types, Tokens, and Context#

Conclusion: The Legacy of a Simple Idea#