If you have a smartphone and a pulse, you are likely familiar with the morning ritual of millions: the New York Times Games app. While Wordle tests your vocabulary and Sudoku tests your logic, there is one game that consistently causes frustration, delight, and intense debate in group chats everywhere: Connections.

The premise is deceptively simple. You are given 16 words. Your job is to sort them into four categories of four words each. But as any player knows, the game is a minefield of “red herrings,” obscure trivia, and lateral thinking puzzles that require you to look at a word not just for what it means, but for how it is spelled, what it sounds like, or what word might come after it in a phrase.

In the world of Artificial Intelligence, we have seen Large Language Models (LLMs) like GPT-4 and Claude crush standardized tests, bar exams, and coding challenges. But do these models actually possess abstract reasoning? Can they “think outside the box” well enough to solve a puzzle designed specifically to mislead you?

A fascinating research paper titled “Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game” by researchers from Barnard College, Columbia University, and Stony Brook University tackles this exact question. They used the Connections game as a rigorous benchmark to test the fluid intelligence of state-of-the-art AI.

In this deep dive, we will explore their methodology, the complex taxonomy of knowledge required to solve these puzzles, and the surprising results of Man vs. Machine.

The Problem: Measuring Fluid Intelligence

Why does this matter? We know LLMs are good at crystallized intelligence—retrieving facts they were trained on. If you ask an AI “What is the capital of France?”, it simply retrieves that association.

However, abstract reasoning (often aligned with fluid intelligence) involves solving novel problems, identifying patterns in noisy data, and working with logical systems where the rules aren’t explicitly stated. This is the frontier of AI research. Previous benchmarks often focused on arithmetic or commonsense reasoning, but it was unclear if LLMs could handle the high-level association tasks required by a game like Connections.

The Test Bed: How Connections Works

The NYT Connections game, launched in June 2023, presents a 4x4 grid.

Create four groups of four!

The player must select four distinct clusters. The categories are color-coded by difficulty:

  1. Yellow (Straightforward): The most intuitive grouping (e.g., “Types of Fruit”).
  2. Green: Slightly harder, often nuanced semantic relations.
  3. Blue: Requires encyclopedic knowledge or specific associations.
  4. Purple (Tricky): The hardest level. These often involve wordplay, fill-in-the-blanks (e.g., “Words after ‘Hot’”), or phonological tricks.

The challenge lies in the red herrings. The game acts as an “adversarial environment.” A word like “Rose” might fit into a category of flowers, but it could also belong to “Past Tense Verbs” (rise/rose) or “Women’s Names.” To solve the puzzle, the player must hold multiple hypotheses in their head simultaneously and deduce the only configuration where all 16 words fit perfectly into four distinct groups.

The Methodology: A Taxonomy of Knowledge

To scientifically evaluate how LLMs reason, the researchers didn’t just count wins and losses. They developed a comprehensive taxonomy of knowledge. They analyzed 438 games (spanning June 2023 to August 2024) and categorized every single solution group into specific types of reasoning required to solve it.

This taxonomy is crucial for understanding why AI models fail. As shown in the diagram below, knowledge is split into three primary branches:

Proposed taxonomy of knowledge types.

1. Word Form

This branch tests knowledge of the word as a symbol, rather than just its definition.

  • Phonology: How the word sounds (e.g., homophones).
  • Orthography: How the word is spelled (e.g., anagrams, palindromes).
  • Morphology: Word structure (e.g., words ending in specific suffixes like “-ship” or “-ness”).
  • Multiword Expressions: Words that are part of a fixed phrase (e.g., “Words after PAY” -> Check, Dirt, Pal, Phone). This is often where models struggle because the word “Dirt” has nothing to do with “Phone” semantically; they only relate through the hidden word “Pay.”

2. Word Meaning

This is the domain where we expect LLMs to excel.

  • Semantic Relations: Synonyms, hypernyms (is-a relationships), and polysemy (multiple meanings).
  • Associative Relations: Words that don’t share a definition but share a context. For example, “Things that are Red” (Mars, Rose, Strawberry, Devil). These items are distinct semantically but linked by a thematic attribute.
  • Encyclopedic: Facts about the world. “Newspaper Names” (Globe, Mirror, Post, Sun). You need to know that “The Sun” is a tabloid, not just a star.

3. Word Form + Word Meaning

The most complex category. It combines structural knowledge with semantic knowledge. An example provided in the paper is “Social Media App Endings” (Book, Gram, In, Tube). You must know the semantic entity (Facebook, Instagram) and perform the morphological operation of stripping the suffix.

The Distribution of Difficulty

The researchers annotated 1,752 categories across the dataset. As seen in the table below, the vast majority of the game relies on Semantic Relations (1045 instances), which helps explain why humans find the game approachable—we are wired for meaning. However, the presence of nearly 200 Multiword Expressions and Encyclopedic clues turns it into a logic puzzle.

Distribution of different knowledge types.

Experimental Setup: LLMs vs. Humans

The researchers tested five state-of-the-art Large Language Models:

  1. Gemini 1.5 Pro (Google)
  2. Claude 3.5 Sonnet (Anthropic)
  3. GPT-4o (OpenAI)
  4. Llama 3.1 405B (Meta)
  5. Mistral Large 2 (Mistral AI)

They used Few-Shot Prompting with Chain-of-Thought (CoT). This means they didn’t just ask the AI for the answer; they gave it examples of how to play and explicitly asked it to “explain your reasoning step-by-step” before giving the final groupings. This mimics how a human player thinks (“Okay, these four look like fish types, but wait, ‘Bass’ could also be an instrument…”).

Scoring Metrics

To evaluate performance, the study used two specific equations.

1. Unweighted Clustering Score: This is a raw count of how many groups (out of 4) the model correctly identified.

Unweighted Clustering Score Equation

Here, \(n_x\) is 1 if the group is correct, and 0 if not. A perfect game is a 4.

2. Weighted Clustering Score: This score rewards the model for solving harder categories.

Weighted Clustering Score Equation

A Yellow category (easiest) is worth 1 point (\(w_0=1\)), while a Purple category (hardest) is worth 4 points (\(w_3=4\)). A perfect weighted score is 10.

Results: How Did the LLMs Perform?

The results were sobering for AI enthusiasts. Despite the massive capabilities of these models, none could consistently “beat” the game.

The Leaderboard

Claude 3.5 Sonnet emerged as the clear winner, but even the “best” performance was far from perfect. Claude only fully solved 18% of the games (getting all 4 categories correct).

Let’s look at the frequency of unweighted scores:

Frequency of unweighted clustering scores.

Take a close look at the chart above (Figure 3 from the paper).

  • Zero Score: Mistral Large 2 failed to get a single correct group in 185 out of 438 games (42%).
  • Perfect Score (4): Claude 3.5 Sonnet achieved 79 perfect games. Llama 3.1 405B managed 47.
  • The “One Group” Trap: Across all models, a very common outcome was getting only one group correct (the score of 1 bars are quite high). This suggests models can spot the obvious “Yellow” category but crumble as the remaining words become more abstract and red herrings tighten their grip.

The weighted scores further illustrate this disparity.

Spread of weighted clustering scores.

In the violin plot above, you can see the density of scores. Gemini 1.5 Pro and Mistral have heavy density near the bottom (scores 0-2). Claude and GPT-4o have a wider spread, reaching into the higher scores, indicating they are occasionally capable of solving the “Purple” and “Blue” categories, but they are inconsistent.

Human vs. Machine: The Reality Check

To contextualize these numbers, the researchers recruited human players. They split them into two groups: Novices (little to no experience) and Experts (regular players).

Novices vs. Claude 3.5 Sonnet

How does the best AI compare to a beginner human?

Frequency of clustering scores: Claude vs Novices.

In a sample of 100 games:

  • Novices (Orange bars) actually struggled quite a bit, with many scoring 0 or 1.
  • Claude (Blue bars) performed slightly better than the average novice in the middle range (scoring 1 or 2).
  • Note: The chart shows Novices getting 0 on score “3”. In Connections, if you get 3 groups right, the 4th is automatically right (because those are the only words left). Humans fundamentally cannot score a “3”. The AI, however, sometimes hallucinates words or repeats them, leading to invalid submissions that might technically contain 3 correct clusters but fail the game rules.

Experts vs. Claude 3.5 Sonnet

This is where the gap becomes undeniable.

Frequency of clustering scores: Claude vs Experts.

In a sample of 50 games:

  • Experts (Brown bars) completely dominated. They achieved a perfect score (4) in 32 out of 50 games (64%).
  • Claude (Blue bars) only solved 10 games perfectly (20%).

This result is critical. It proves that the NYT Connections game is not “unsolvable” or “random.” It is a solvable logic puzzle for a proficient human mind. The fact that the most powerful AI models struggle to reach even a 20% solve rate highlights a significant deficiency in their abstract reasoning capabilities.

The weighted score distributions reinforce this. Look at the expert human distribution (orange) in the bottom graph below versus Claude (blue):

Spread of weighted clustering score comparison.

The Expert Human distribution is top-heavy (clustered around scores of 8-10), whereas Claude is distributed lower.

Why Do LLMs Fail? A Forensic Analysis

The paper’s most interesting contribution is the analysis of why the failure happens. By leveraging their taxonomy, the researchers pinpointed the specific cognitive weak points of LLMs.

1. The “Tokenization” Blind Spot (Word Form)

LLMs process text as “tokens” (chunks of characters), not as visual words or phonetic sounds.

  • Morphology/Orthography/Phonology: LLMs performed terribly here. If the category is “Words starting with ‘S’”, an LLM might miss it because the semantic meaning of the word distracts it.
  • Multiword Expressions: This was a major failure point.
  • Example: Words following “FIRE” -> Ant, Drill, Island, Opal.
  • To an LLM, “Opal” is a gemstone. “Fire Opal” is a concept, but linking “Opal” to “Ant” solely through the hidden word “Fire” requires a multi-step leap that current architectures struggle with.

2. The Semantic Crutch

LLMs are built on massive amounts of text data, creating strong vector embeddings for semantic relationships.

  • Semantic Relations: This was the AI’s strongest suit. If the category is “Types of Fish,” the vector similarity between “Bass,” “Salmon,” and “Trout” is high.
  • Encyclopedic Knowledge: Models also performed decently here (e.g., knowing specific members of a band), provided the information was in their training data.

The chart below visualizes this discrepancy perfectly.

Performance by Reasoning Type.

Look at the Semantic Relations and Encyclopedic bars—they are tall. Now look at Multiword Expressions and Word Meaning + Word Form. The accuracy drops precipitously. Claude 3.5 Sonnet (Teal bar) is the only one putting up a fight in the “Multiword” category, likely due to better training on idioms or phrasal verbs, but it still fails most of the time.

3. The Red Herring Problem

Connections is designed to trick you. It uses “distractors.”

  • Red Herring Category: Three words fit a category perfectly, but the fourth is missing.
  • Red Herring Word: A category exists, but 5 or 6 words on the board could fit into it (you only need 4).

Example 1: The Red Herring Category In the image below, the words SKIM, WHOLE, and SOY are highlighted. A human (or AI) might immediately think “Milk Types.” But there is no fourth milk type on the board. The AI will often hallucinate a fourth connection or force a weak link just to satisfy the “Milk” hypothesis. In reality, these words belonged to three totally different groups (Whole -> Numbers, Skim -> Touch lightly, Soy -> Sauces).

Red Herring Category Example

Example 2: The Red Herring Word In the example below, look at the Christmas-themed words: STOCKING, CANDY CANE, REINDEER, MISTLETOE. That looks like a group! But wait… there’s also SNOWMAN (not highlighted). There are five Christmas words. The puzzle requires you to realize that “Candy Cane” actually belongs to “Things with Stripes,” leaving the other four for the Christmas category. LLMs (except Claude) fell for this trap, grouping Candy Cane with Reindeer and failing to check if a better configuration existed.

Red Herring Word Example

4. Right Answer, Wrong Reason

Interestingly, the researchers found that sometimes LLMs got the group right but for the wrong reason.

  • Ratio of Reasoning to Clustering: The researchers calculated how often the models correctly explained their groups.
  • The categorical reasoning scores (Table 6 below) are consistently lower than the clustering scores (Table 5). This means there were cases where the AI guessed the group by luck or weak association but failed to identify the true unifying theme (e.g., the “Purple” wordplay category).

Categorical Reasoning Scores

Conclusion: The “System 2” Gap

The findings of this paper suggest that while LLMs are incredibly knowledgeable, they lack “System 2” thinking.

  • System 1 is fast, intuitive, and associative (like spotting that “Salmon” and “Trout” are fish). LLMs are great at this.
  • System 2 is slow, deliberative, and logical. It involves saying, “Okay, ‘Star’ fits here, but if I use ‘Star’ in this group, I can’t finish the Movie group. Therefore, I must backtrace and try a different combination.”

The NYT Connections game is a System 2 workout. It requires constraint satisfaction and global optimization—you can’t just pick the first four words that look related; you have to ensure the remaining 12 words also form valid groups.

What’s Next?

The paper concludes that to beat Connections, future AI systems might need:

  1. Iterative Reasoning: The ability to make a guess, check the remaining words, realize a conflict, and backtrack (similar to how AlphaGo plays).
  2. Retrieval Augmentation: Access to dictionaries or WordNet to check specific lexical properties (e.g., “does this word end in ‘ship’?”).
  3. Synthetic Training: Training on datasets specifically designed to teach lateral thinking and wordplay.

Until then, human experts can rest easy. You might not be able to write code as fast as GPT-4o, but when it comes to figuring out that “Sponge,” “Bob,” “Square,” and “Pants” belong together? You’re still the champion.


This blog post explains the research paper “Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game” by P. Samadarshi et al. (2025).