Can AI Read Between the Lines? Benchmarking Abstract and Long-Context Analogies in LLMs

Introduction: The “Empty Cup” Problem

Isaac Newton once famously remarked, “If I have seen further, it is by standing on the shoulders of giants.” He wasn’t literally standing on people; he was using an analogy to describe how scientific progress is built upon previous discoveries.

Analogical reasoning—the ability to recognize that Situation A is like Situation B because of a shared underlying structure—is a cornerstone of human cognition. It allows us to learn new concepts, solve problems creatively, and communicate complex ideas. We do this effortlessly. If I tell you a story about a tree that collapsed because its trunk was rotten, and then a story about a person who collapsed from burnout because they didn’t care for themselves, you instantly recognize the connection: internal neglect leading to external failure.

But can Large Language Models (LLMs) do this?

We know LLMs are good at surface-level patterns and simple word analogies (like “King is to Man as Queen is to Woman”). However, a recent paper titled “ANALOBENCH: Benchmarking the Identification of Abstract and Long-context Analogies” poses a tougher question: Can AI identify analogies when they are hidden within long, complex narratives?

The researchers introduce ANALOBENCH, a new benchmark designed to test if models can recall relevant experiences from large pools of information and apply reasoning to lengthy scenarios. The results are surprising: while models are getting bigger, their ability to “read between the lines” in long texts isn’t scaling up as we might hope.

Figure 1: The problem setup: given a story, the goal is to identify an analogous story from a story bank. In the example, both “Maria” and “the oak” lose the ability to provide for others.

As shown in Figure 1, the core of the problem is recognizing that Maria’s exhaustion (“can’t pour from an empty cup”) is analogous to the fallen oak tree (“cannot provide shade”).

Background: Beyond Simple Word Games

To understand why this research matters, we have to look at how AI has historically handled analogies.

For a long time, benchmarks focused on lexical analogies. These are the SAT-style questions involving single words. While useful for testing early word embeddings (like Word2Vec), they don’t capture the depth of human reasoning. Humans don’t just match words; we map structures.

According to Structure Mapping Theory (a concept from cognitive science), an analogy relies on shared relational patterns, not shared attributes. For example, an atom is like a solar system not because electrons look like planets, but because the relationship of orbiting a central nucleus mirrors the relationship of planets orbiting a star.

The Missing Piece: Length and Abstraction

Previous benchmarks have tried to test this using proverbs or metaphors. However, modern LLMs have likely seen every common proverb in their training data, making it a test of memory rather than reasoning. Furthermore, real-world analogies often involve recalling memories from days, weeks, or years ago—complex “stories” stored in our long-term memory.

The authors of ANALOBENCH identified two specific human capabilities that were missing from AI evaluation:

Long-context reasoning: The ability to pinpoint analogies between prolonged experiences (e.g., “Writing a thesis is like running a marathon”).
Retrieval from memory: The ability to pick the right analogy out of a massive “haystack” of irrelevant memories.

Core Method: Building ANALOBENCH

The researchers constructed a dataset that prioritizes quality and complexity over sheer volume. Unlike many datasets scraped from the web, ANALOBENCH was handcrafted.

1. Creating the Seed Stories

The team started by writing 340 high-quality, human-authored analogies. They utilized a “cluster” approach. If Story A is analogous to Story B, and Story B is analogous to Story C, then A and C are also analogous. This transitivity allowed them to build robust clusters of related narratives.

Crucially, the annotators were instructed to avoid surface-level similarities. If one story was about a “storm,” the analogous story shouldn’t mention “rain” or “wind” unless necessary. This forces the model to look at the plot and relationships, not just keyword matching.

Figure 3: An overview of dataset creation. Left: Human annotators create pairs. Right: Pairs are grouped into clusters.

2. Expanding the Context

To test the “Long-Context” aspect, the researchers didn’t just stop at single sentences. They used GPT-4 to expand these seed analogies into longer stories. This resulted in three versions of every analogy:

1-Sentence: The core concept (e.g., “All that glitters is not gold”).
10-Sentences: A short paragraph elaborating on the concept.
30-Sentences: A detailed narrative with “noise”—extra details that preserve the analogy but make it harder to spot.

This expansion mimics real life. When you compare a current situation to a past memory, you have to filter out the irrelevant details (what you were wearing, the weather) to find the structural match.

Figure 2: Overview of ANALOBENCH. The benchmark features two tasks: Identify analogies from a mini story bank and from a large story bank.

3. The Two Tasks

The paper evaluates models on two distinct tasks, as illustrated in Figure 2:

Task 1 (\(T_1\)): The Mini Story Bank

Here, the model is given a target story and four options (one correct analogy, three distractors). This is a standard multiple-choice format.

Goal: Test pure reasoning ability. Can the model distinguish the correct analogy when the search space is small?

Figure 6: Analogy Selection Prompt for Different Models.

Figure 6 shows how this looks in practice. The model sees the target and options A, B, C, D. It must select the best match.

Task 2 (\(T_2\)): The Large Story Bank

This is the “Needle in a Haystack” test. The model is given a target story and a bank of 200 stories. It must retrieve the top 10 most analogous stories.

Goal: Test retrieval and long-context memory. This simulates a human recalling a relevant past experience from their life’s history.

Experiments & Results: The “Scale” Trap

The researchers tested a wide range of models, including open-source options like LLaMA-2 and proprietary giants like GPT-4 and Claude-v2. The results revealed some fascinating limitations of current AI.

Result 1: Length Breaks the Models (\(T_1\))

In the multiple-choice task, performance was decent for short (1-sentence) analogies. GPT-4 achieved nearly 90% accuracy. However, as soon as the stories got longer, performance plummeted.

Figure 4: Accuracy of LMs on T1. Left: Scaling works for short stories. Right: Accuracy drops as story length increases.

Look at the right side of Figure 4. The black dashed line represents human performance. Notice how stable it is? Humans are roughly as good at identifying analogies in 30-sentence stories as they are in 1-sentence stories. In fact, human annotators reported that the longer stories were actually easier because the extra detail helped disambiguate the meaning.

Now look at the colored lines for the AI models. They all slope downward. The more context provided, the worse the models performed.

Even more concerning is the “Scaling” chart on the left of Figure 4. For short stories, making the model bigger (more parameters) leads to better accuracy. But for long stories (the middle and right graphs), the lines flatten out. Simply making the model bigger does not seem to solve the problem of understanding complex, long-form analogies.

Result 2: Humans are Still Superior

The gap between human and machine is stark.

Table 2: Benchmarking various models. Humans outperform even GPT-4, especially on longer texts.

Table 2 highlights this gap. On 30-sentence stories, humans achieved 73.3% accuracy. The best model, GPT-4, managed 60.7%, while many open-source models dropped to near-random guessing (~25%). This suggests that while models are “reading” the text, they aren’t forming the abstract mental models necessary to connect two disparate long narratives.

Result 3: The Retrieval Nightmare (\(T_2\))

If the multiple-choice task was hard, the retrieval task (finding the analogy in a bank of 200 stories) was nearly impossible for the models.

Figure 5: Precision-recall plot of LMs on T2. With increasing story length, performance approaches random.

Figure 5 displays Precision-Recall curves. In an ideal world, these lines would be in the top-right corner.

Left (1-Sentence): GPT-4 (blue line) does okay. It can find short, punchy analogies.
Right (30-Sentences): The lines collapse to the bottom.

For the 30-sentence stories, the models’ ability to retrieve the correct analogy was barely better than picking stories at random. This indicates that current context-window technologies (which allow models to “read” large amounts of text) might process the words, but they struggle to maintain the structural meaning required for analogical retrieval.

Why Does This Matter?

You might wonder, “So what if an AI can’t match stories?”

The implications are significant for how we use AI in the real world.

Legal Tech: A lawyer might use AI to find a “precedent” case. This is an analogical task. “Find me a case where a company was liable for negligence due to a third-party vendor, similar to my current client’s situation.” If the AI can’t handle long contexts, it might miss the perfect precedent because it got distracted by surface details.
Scientific Innovation: Innovation often comes from cross-domain analogies (e.g., the structure of the atom resembling the solar system). An AI that can’t abstract structure from long scientific papers will struggle to be a true “co-scientist.”
General Reliability: The failure of “scaling” to fix this problem (as seen in the experiments) suggests that we might need new architectures or training methods, not just bigger GPUs, to achieve human-like reasoning.

Conclusion

The ANALOBENCH paper provides a reality check for the capabilities of Large Language Models. It shows that while AI has made tremendous strides, it still lacks the “core of cognition”—the ability to effortlessly map abstract relationships across long, complex experiences.

The researchers demonstrated that:

Context is a double-edged sword: While extra detail helps humans understand analogies better, it acts as “noise” that confuses LLMs.
Scale isn’t a magic fix: Simply making models larger offered minimal gains for long-context analogical reasoning.
The Human Advantage: Humans remain the undisputed champions of abstract pattern matching, easily recalling “needles” from the “haystacks” of our memories.

For students of AI, this paper highlights a fertile ground for future research. How do we teach models to ignore the fluff and see the structure? Until we solve that, AI will remain a system that can read a library of books but might miss the moral of the story.

Introduction: The “Empty Cup” Problem#

Background: Beyond Simple Word Games#

The Missing Piece: Length and Abstraction#

Core Method: Building ANALOBENCH#

1. Creating the Seed Stories#

2. Expanding the Context#

3. The Two Tasks#

Task 1 (\(T_1\)): The Mini Story Bank#

Task 2 (\(T_2\)): The Large Story Bank#

Experiments & Results: The “Scale” Trap#

Result 1: Length Breaks the Models (\(T_1\))#

Result 2: Humans are Still Superior#

Result 3: The Retrieval Nightmare (\(T_2\))#

Why Does This Matter?#

Conclusion#