If you have been following the explosion of Large Language Models (LLMs), you are likely familiar with Retrieval-Augmented Generation (RAG). It is the standard architecture for building AI systems that “know” things outside of their training data. The formula is generally simple: a user asks a question, a retriever hunts down relevant text chunks from a database (like Wikipedia), and an LLM synthesizes an answer based on those chunks.

However, there is a hidden variable in this formula that researchers often overlook: Granularity.

When we index our data, how big should the chunks be? Should we retrieve whole documents? Paragraphs? Sentences? For years, the industry standard has been the 100-word passage. It seemed like the “Goldilocks” zone—not too long, not too short.

But a fascinating research paper titled “Dense X Retrieval: What Retrieval Granularity Should We Use?” challenges this assumption. The researchers argue that passages are often full of irrelevant noise, and sentences lack necessary context. They propose a new unit of retrieval that changes the game: the Proposition.

In this post, we will deep dive into this paper to understand why the size of your retrieval unit matters, what a “proposition” actually is, and how it dramatically improves performance for open-domain Question Answering (QA).

The Problem with Passages and Sentences

To understand why we need a new retrieval unit, we first need to look at the flaws in our current methods. Most dense retrievers (like DPR or Contriever) are trained to map queries and documents into a vector space.

The Passage (Too Noisy)

Typically, we slice documents into fixed-length passages (e.g., 100 words).

  • Pros: They usually contain enough context to understand the topic.
  • Cons: They contain a lot of “fluff.” If a 100-word passage contains the answer in one sentence, the other 90 words might be irrelevant details about dates, other people, or side stories. This noise distracts the retrieval model (lowering the relevance score) and can confuse the LLM during answer generation.

The Sentence (Too Vague)

Alternatively, we could index individual sentences.

  • Pros: High information density. Precise.
  • Cons: They often lack context. Consider the sentence: “He signed the bill into law on Tuesday.” If you retrieve this sentence in isolation, you have no idea who “He” is or what “the bill” refers to. Without that context, the sentence is useless for answering a question.

The “Goldilocks” Dilemma

The researchers visualize this problem perfectly in the comparison below.

An example of three granularities of retrieval units: Passage, Sentence, and Proposition. The proposition is concise and self-contained. Figure 1: Comparing three levels of granularity. Notice how the Passage (blue) includes irrelevant details about restoration work. The Sentence (green) relies on context from previous sentences (referring to “the tower” rather than “The Leaning Tower of Pisa”). The Proposition (red) rewrites the fact to be fully self-contained.

As shown in Figure 1, the Proposition (in red) bridges the gap. It takes the semantic meaning of the sentence but rewrites it to stand alone, replacing pronouns with full entity names. It is concise and contextualized.

What Exactly is a “Proposition”?

The core contribution of this paper is defining and operationalizing the Proposition as a retrieval unit. The authors define a proposition based on three principles:

  1. Distinct Meaning: Each proposition encapsulates a distinct factoid or piece of meaning from the text.
  2. Minimal: It cannot be further split into smaller propositions. It is atomic.
  3. Contextualized & Self-Contained: It includes all necessary context (like resolving pronouns) so it can be understood entirely on its own, without reading the surrounding text.

This concept draws inspiration from linguistics and semantic evaluation, but applying it to dense retrieval indexing is a novel approach. The goal is to create an index where every single entry is a dense, clean nugget of information.

Methodology: Building FactoidWiki

The researchers faced a massive logistical challenge: Wikipedia is huge. You cannot manually rewrite millions of Wikipedia pages into propositions. To solve this, they automated the process using a Teacher-Student approach to create a dataset they call FactoidWiki.

The Propositionizer

They built a model called the Propositionizer to segment text automatically.

  1. The Seed (GPT-4): They first prompted GPT-4 with specific instructions to break paragraphs down into propositions. They used this to generate a high-quality seed dataset of 42k passages.
  2. The Student (Flan-T5): Using the GPT-4 data as a training set, they fine-tuned a Flan-T5-large model. This smaller, faster model could then process the entire English Wikipedia dump.

The scale of this effort is impressive. They processed 6 million Wikipedia pages, resulting in 257 million propositions, compared to 41 million passages and 114 million sentences.

A diagram illustrating the workflow: text is processed by the Propositionizer into FactoidWiki, which is then used for retrieval and QA. Figure 2: The pipeline for Dense X Retrieval. (A) The Propositionizer converts raw text into atomic facts. (B) These form the FactoidWiki index. (C) A retriever finds relevant propositions. (D) The QA model answers the user’s question.

As seen in Figure 2, the process shifts the heavy lifting to the indexing phase. Once the index is built (FactoidWiki), the retrieval process (Stages C and D) remains standard, but the quality of the data flowing into the system is significantly higher.

Experimental Setup

To prove that propositions are superior, the authors conducted a comprehensive evaluation.

  • Task: Open-Domain Question Answering.
  • Datasets: Natural Questions (NQ), TriviaQA (TQA), Web Questions (WebQ), SQuAD, and Entity Questions (EQ).
  • Retrieval Models: They tested four popular dense retrievers:
  • SimCSE and Contriever (Unsupervised).
  • DPR and GTR (Supervised/Trained on passage pairs).
  • Metric: Recall@k (Did the retrieved documents contain the answer?) and Exact Match (Did the LLM generate the correct answer?).

This setup is rigorous because it tests both unsupervised models (which haven’t learned to rely on specific chunk sizes) and supervised models (which were explicitly trained on 100-word passages).

Key Findings: Retrieval Performance

The results were surprisingly consistent: indexing by proposition outperforms indexing by passage or sentence.

This is particularly notable because models like DPR and GTR were trained to retrieve passages. Even though they were never trained on propositions, they performed better when using them.

Table showing passage retrieval performance across different datasets and models. Propositions generally score highest. Table 1: Retrieval performance (Recall@5 and Recall@20). Look at the “Avg” column. For unsupervised retrievers like SimCSE and Contriever, propositions (bottom row of each block) offer a massive performance jump.

Analysis of the Results

  1. Unsupervised Dominance: For models like SimCSE, using propositions improved Recall@5 by 12.0 points on average compared to passages. This suggests that when a model looks for semantic similarity without bias, the proposition is the purest match.
  2. Generalization: On datasets that the supervised models had not seen during training (like SQuAD and Entity Questions), propositions provided a significant boost. This indicates that proposition retrieval generalizes better to new domains than passage retrieval.

Why Do Propositions Win? (Information Density)

One of the most compelling arguments in the paper is about Information Density. In RAG systems, we have a limited context window. We can only feed a certain number of words to the LLM.

If you retrieve 5 passages (approx. 500 words), you might get 5 sentences of actual answers and 45 sentences of noise. If you retrieve 50 propositions (approx. 500 words), you get 50 distinct facts.

The researchers analyzed “Recall vs. Word Count” to visualize this.

Line charts showing Recall vs. Number of Words. The red line (Propositions) consistently stays above blue (Passage) and green (Sentence). Figure 3: This graph shows how quickly the retriever finds the answer. The X-axis is the number of words retrieved. The Y-axis is the Recall. The Red line (Propositions) shoots up faster than the others. This means you find the answer with fewer words.

As Figure 3 shows, propositions allow the system to pack more relevant information into a tighter budget. For real-world applications where API costs and latency depend on token count, this is a massive advantage. You can achieve the same accuracy with much less text.

The “Long-Tail” Advantage

Another fascinating discovery was how propositions handle rare information. The researchers broke down performance based on entity frequency—how often the subject of the question appears in Wikipedia.

Common entities (like “Barack Obama” or “Paris”) are easy for any retriever. But “Long-Tail” entities (obscure actors, specific chemical compounds, minor historical events) are where dense retrievers usually struggle.

Recall vs. Entity Frequency. Propositions (Red) show a large gap over passages (Blue) on the left side of the graph (rare entities). Figure 4: Performance on “Entity Questions.” The left side of the x-axis represents rare/uncommon entities. Notice the large gap between the Red line (Propositions) and the Blue line (Passages).

As illustrated in Figure 4, the advantage of propositions is most pronounced for rare entities. Why? Because in a passage, a rare entity might be mentioned once, surrounded by common words that dilute the vector embedding. In a proposition, the rare entity is the star of the show. The embedding focuses entirely on that specific factoid.

Case Studies: When Passages Fail

To make this concrete, let’s look at some error analysis provided in the paper. These examples show exactly why passages and sentences fail where propositions succeed.

Table showing examples of retrieval errors. Passages fail due to distraction; sentences fail due to lack of context. Table 2: Error analysis. In Q1, the passage discusses “Super Bowl X” instead of “Super Bowl 50” because of keyword overlap. In Q3, the sentence retrieval fails because the sentence uses the pronoun “it” instead of “pericardium.”

  • The Context Problem (Q3 in Table 2): The question asks about the function of the “pericardial sac.”
  • Sentence Retrieval finds a sentence that says: “It separates the heart from interference…” The retriever scores this low because “It” doesn’t match “pericardial sac.”
  • Proposition Retrieval finds: “The epicardium forms part of the pericardial sac that surrounds, protects, and lubricates the heart.” By explicitly naming the entity, the match is found.
  • The Distraction Problem (Q1 in Table 2): The question asks about the theme of Super Bowl 50.
  • Passage Retrieval returns a chunk about Super Bowl X (10) because it discusses themes and jerseys. The dense vector was “distracted” by the general discussion of Super Bowl themes.
  • Proposition Retrieval finds the specific fact about Super Bowl 50’s “golden anniversary” because that proposition is tightly focused on that specific game.

Conclusion and Implications

The paper “Dense X Retrieval” offers a compelling correction to how we build RAG systems. It suggests that we have been using the wrong unit of measurement for years. We treated documents like physical paper, cutting them into arbitrary lengths, when we should have been treating them as collections of facts.

Key Takeaways for Students and Practitioners:

  1. Granularity Matters: Do not just accept the default chunk size of your vector database. Experiment with smaller, cleaner units.
  2. Density is Efficiency: By using propositions, you can retrieve the same amount of knowledge with fewer tokens. This leads to faster, cheaper, and more accurate LLM inference.
  3. Cross-Task Generalization: If you can’t fine-tune your retriever for your specific domain, using proposition-level indexing is a great way to boost performance “out of the box.”

The “Propositionizer” approach does require a heavy upfront cost—processing your entire corpus to rewrite it is not cheap. However, the inference-time benefits (better accuracy, better generalization, and efficient prompting) likely outweigh the indexing costs for high-performance applications.

As we move toward more advanced AI systems, the quality of our data indexing will become just as important as the quality of our models. This paper puts a stake in the ground: the future of retrieval is atomic.