Vector embeddings have transformed information retrieval. From powering Google Search to surfacing the perfect product on Amazon, dense vector representations are now the backbone of modern search systems. We’ve asked them to tackle increasingly complex tasks—follow intricate instructions, perform multi-modal search, reason about code—and have assumed that with bigger models and better data, a single embedding vector could eventually represent any query and retrieve documents based on any notion of relevance.

But what if there’s a fundamental mathematical limit to what a single vector can represent? What if, no matter how large or well-trained our models become, there are certain combinations of relevant documents they simply cannot retrieve for a given query?

A recent paper from researchers at Google DeepMind and Johns Hopkins University, “On the Theoretical Limitations of Embedding-Based Retrieval”, explores exactly this question. It bridges the gap between the practical world of neural retrieval and the abstract realm of communication complexity theory. The authors prove that the dimensionality of an embedding vector imposes a hard cap on the possible top-k document sets it can represent—and they confirm this with empirical tests. To drive the point home, they build a dataset called LIMIT that exploits this constraint, breaking even state-of-the-art models.

In this article, we’ll unpack the theory, walkthrough the experiments, and discuss what these findings mean for the future of retrieval.


The Rise of the All-Purpose Embedding

For decades, information retrieval was dominated by sparse methods like BM25, which represent documents as high-dimensional, sparse vectors based on word counts. While effective, these methods struggle with semantic understanding—handling synonyms, paraphrasing, and conceptual queries.

Dense retrieval changed the game. Neural networks now map queries and documents into a shared, lower-dimensional space (e.g., 768 or 4096 dimensions) where semantic similarity aligns with geometric proximity. Ask “what to see in the eternal city” and your query vector should land near a vector for “landmarks in Rome.”

Dense retrieval has proven immensely powerful. But new benchmarks have pushed these models far beyond keyword matching, into realms of instruction-following and reasoning. Tasks like:

  • Logical operations in queries: “Moths or Insects or Arthropods of Guadeloupe”
  • Reasoning-based relevance: “Find other Leetcode problems that use dynamic programming”

These tasks require the model to dynamically group documents that might previously have been unrelated—dramatically increasing the number of possible “relevant sets” to represent. The implicit challenge: handle every possible task.

The paper asks: Is that even possible?


The Theoretical Limit: Where Geometry Meets Retrieval

The core insight connects retrieval to a concept in learning theory: sign rank.

Formalizing Retrieval

Consider a retrieval task with m queries and n documents. We represent ground-truth relevance with a binary matrix:

\[ A \in \{0, 1\}^{m \times n} \]

where \(A_{ij} = 1\) if document j is relevant to query i. This is often called the qrels matrix.

An embedding model generates:

  • Query vectors \(\mathbf{u}_i \in \mathbb{R}^d\)
  • Document vectors \(\mathbf{v}_j \in \mathbb{R}^d\)

Scores are typically dot products: \(s_{ij} = \mathbf{u}_i^T \mathbf{v}_j\). For each query, all relevant docs should score higher than irrelevant ones.

Stacking all query and document vectors into matrices \(U\) and \(V\), the score matrix is:

\[ B = U^T V \]

The rank of \(B\) is at most \(d\), the embedding dimension. The question becomes:

What is the smallest \(d\) such that there exists a rank-\(d\) score matrix \(B\) that correctly orders documents for every query?

The authors call this the row-wise order-preserving rank of \(A\), denoted \(\operatorname{rank}_{\mathrm{rop}} A\).

Enter Sign Rank

The sign rank of a matrix \(M \in \{-1, +1\}^{m \times n}\) is the smallest rank of a real matrix \(B\) whose entries have the same signs as \(M\).

Transforming the binary relevance matrix \(A\) into \((2A - \mathbf{1}_{m \times n})\) yields a +1 for relevant docs and -1 for irrelevant docs.

The key theoretical result (Proposition 2) is:

\[ \operatorname{rank}_{\pm}(2A - \mathbf{1}) - 1 \le \operatorname{rank}_{\mathrm{rop}} A \le \operatorname{rank}_{\pm}(2A - \mathbf{1}) \]

In plain terms:

  • The minimum embedding dimension needed to perfectly solve a retrieval task is almost exactly equal to the sign rank of its transformed relevance matrix.

Implications

  1. Hard tasks are inevitable. Matrices exist with arbitrarily high sign rank. For any fixed \(d\), there will be retrieval tasks that are impossible to solve perfectly.
  2. Task difficulty is quantifiable. The representational complexity of a retrieval task is captured by the sign rank of its qrel matrix.

Testing the Theory: Best-Case Optimization

The authors next ask: does this limitation appear in practice?

Free Embedding Optimization

They sidestep natural language entirely. In the free embedding experiment, queries and docs are represented as directly trainable vectors, optimized on the test set with InfoNCE loss. No linguistic constraints, no generalization required.

The goal: find the critical-n for fixed \(d\) and top-k = 2—the maximum number of docs \(n\) before all \(\binom{n}{k}\) possible relevant sets can no longer be represented.

Relationship between embedding dimension and critical-n, showing a polynomial fit.

Figure 2 | Critical-n grows polynomially with embedding dimension \(d\).

Results show a clean cubic relationship between \(d\) and the maximum representable \(n\). Even in this idealized setting, real-world-scale corpora blow past feasible limits. For \(d=1024\), critical-n is ~4M docs; for \(d=4096\), ~250M. Web-scale (billions of docs) exceeds these limits.


LIMIT: A Dataset to Break Models

With theory confirmed in abstract, the authors tested real models with a natural language dataset built to maximize combinatorial complexity.

Process mapping abstract relevance combinations to simple natural language statements.

Figure 1 | LIMIT dataset creation. All combinations of relevance for N docs are mapped into simple “likes” statements.

Construction

  1. Dense combinations: Start with 46 “core” docs. Create all \(\binom{46}{2} = 1035\) queries, each relevant to a unique doc pair.
  2. Natural language mapping: Queries are “Who likes X?”, with docs as e.g., “Jon Durben likes Quokkas and Apples.”
  3. Realistic corpus: Embed 46 core docs in a larger set of 50k docs. Also make a “small” version with only the 46 relevant docs.

The task is linguistically trivial but combinatorially extreme.


Results: State-of-the-Art Meets Its Match

Recall@2, Recall@10, Recall@100 for multiple models on full LIMIT dataset.

Figure 3 | Full LIMIT results. Even top dense models struggle to exceed 20% Recall@100.

Findings:

  • Dimensionality matters: Higher \(d\) improves recall across all models—matching theory.
  • Alternate architectures excel: Sparse BM25 and multi-vector GTE-ModernColBERT vastly outperform dense single-vector models.

Even on the small version:

Recall@2, Recall@10, Recall@20 for small LIMIT dataset.

Figure 4 | LIMIT-small (N=46) results. No model achieves perfect recall.


Ablations: Pinpointing the Cause

Domain shift? Fine-tuning on LIMIT training data had negligible effect. Overfitting on the test set solved the task—proving the limitation is representational, not due to unseen query types.

Training-set vs test-set fine-tuning performance on LIMIT.

Qrel pattern? Changing the relevance pattern to random, cycle, or disjoint made the task much easier. The original dense pattern remained hardest.

Performance drop on dense qrel patterns vs other patterns.

Correlation to BEIR? None. LIMIT measures an orthogonal capability.

Scatter plot showing no correlation between BEIR and LIMIT scores.


What This Means for Retrieval

Key takeaways:

  1. Fundamental limit: For fixed embedding dimension, certain retrieval tasks are unsolvable for single-vector models.
  2. Combinatorial complexity kills performance: Higher sign rank equals harder tasks.
  3. LIMIT exposes fragility: Even simple-appearing tasks can break top models if relevance is densely interconnected.

Dense retrieval is still valuable—in many practical situations we don’t need all combinations. But for general-purpose, instruction-following retrieval (“any query, any relevance”), single-vector embeddings may hit an unavoidable wall.


Paths Forward

Multi-vector models: More expressive by design. Show strong LIMIT performance but underexplored for instruction-following.

Sparse models: Huge dimensionality avoids geometric limits, but harder to apply to reasoning tasks without lexical overlap.

Cross-encoders: Not practical for first-stage retrieval, but rerankers like Gemini-2.5-Pro can solve LIMIT-small effortlessly.


Dense retrieval isn’t dead—it’s just bounded. As we build the next generation of retrieval agents, the future may lie in hybrid approaches: maintaining dense efficiency while incorporating the expressiveness of multi-vector and sparse methods.

Understanding and respecting these theoretical limits will help us design models that fail less often in the face of combinatorial complexity—and succeed more often when it matters most.