Introduction

Imagine a student preparing for a difficult mathematics exam. They don’t just memorize formulas; they work through practice problems. When they solve a problem correctly, they remember the logic they used. Later, when they face a similar but new problem, they recall that successful logic to guide them. This process—accumulating experiences, filtering out the mistakes, and recalling the most relevant and complex solutions—is fundamental to human learning.

However, standard Large Language Models (LLMs) often lack this dynamic “experiential” capability in their standard deployment. They are typically static. You prompt them, they answer, and the interaction ends. If they solve a problem brilliantly, that “thought process” usually evaporates once the session closes.

While techniques like Chain-of-Thought (CoT) prompting have revolutionized how LLMs handle reasoning by asking them to “think step by step,” there is a catch. Zero-shot prompting (asking without examples) is often unreliable for complex tasks. Few-shot prompting (providing examples) works better, but it relies heavily on humans manually crafting perfect examples.

What if an LLM could build its own “experience pool” as it answers questions, autonomously figuring out which of its past answers were good, and using those to help solve future problems?

This is the premise of RoSE (Reasoning with Orchestrated Streaming Experiences), a novel framework presented by researchers from Fudan University. RoSE allows an LLM to self-improve in a streaming setting without any human-labeled data or external feedback. In this article, we will deconstruct how RoSE turns an LLM into an experiential learner that orchestrates its own memory to become a better reasoner.

Background: The Prompting Dilemma

To appreciate RoSE, we first need to understand the current landscape of LLM reasoning.

The Power of Chain-of-Thought (CoT)

Reasoning tasks—like math word problems or common sense logic puzzles—are notoriously hard for language models because they require multi-step deduction.

  • Few-Shot CoT: This involves feeding the model a prompt containing a few questions, their reasoning steps (rationales), and the final answers. The model mimics this pattern.
  • Zero-Shot CoT: Surprisingly, simply appending “Let’s think step by step” to a question can trigger the model to generate its own reasoning path.

The Limitations

While effective, these methods have bottlenecks:

  1. Dependency on Manual Effort: Few-shot CoT requires “golden” examples. If the examples are bad, the output is bad.
  2. The “Copy Effect”: If you provide examples that are too similar to the test question, the model might lazily copy the answer or the logic pattern from the example, leading to errors.
  3. Static Nature: Standard prompts don’t evolve. The model doesn’t learn from the 100 questions it just answered.

Existing solutions like Auto-CoT try to automate example selection using clustering, but they often lack nuance in selecting the quality of the examples. RoSE aims to solve this by creating a dynamic system that considers not just similarity, but also the uncertainty and complexity of the experiences it stores.

The RoSE Framework

The core innovation of RoSE is that it operates in a streaming setting. As questions arrive one by one, the system answers them and stores the interaction. Over time, it builds a massive library of solved problems. When a new question arrives, it acts like an orchestrator, conducting a search through its memory to find the most helpful “experiences” to use as demonstrations.

The architecture is visualized in the figure below:

Figure 1: The overview of RoSE

As shown above, the workflow is cyclical:

  1. A Test Question arrives.
  2. RoSE searches its Streaming Experience Pool (past answered questions).
  3. It performs Experience Orchestration to select the best examples based on Diversity, Uncertainty, and Complexity.
  4. It constructs a prompt using these selected experiences to answer the test question.
  5. The new Question, Reasoning Path, and Answer are added back to the pool to help with future questions.

Let’s break down the mechanics of how RoSE measures and selects these experiences.

1. The Experience Pool and Attributes

For the system to work, it cannot simply store every piece of text it generates. It needs to tag each memory with metadata that indicates its quality. RoSE attaches two critical attributes to every stored question: Uncertainty and Complexity.

Calculating Uncertainty

How does the model know if its own past answer was likely correct without a human checking it? The researchers use the concept of Self-Consistency.

When RoSE processes a question, it doesn’t just generate one answer. It generates multiple different reasoning paths (e.g., 20 different attempts). It then looks at the final answers produced by these paths.

  • If 19 out of 20 paths lead to the answer “42,” the model is confident (Low Uncertainty).
  • If the answers are scattered (some say “42”, some “12”, some “100”), the model is confused (High Uncertainty).

Mathematically, this is calculated using entropy. First, they identify the unique answers and their probabilities:

Equation for probability of answers

Here, \(p(a_i^*)\) represents the probability of a specific answer occurring among the generated paths. The uncertainty \(u_{q_t}\) is the entropy of this distribution. High entropy means high uncertainty.

Why does this matter? The researchers found a strong correlation between uncertainty and accuracy. As shown in the graph below, as uncertainty increases (moving right on the x-axis), the accuracy (dashed line) plummets.

Figure 2: The relation between accuracy and the magnitude of uncertainty value on SVAMP dataset.

By filtering out experiences with high uncertainty, RoSE avoids using “hallucinated” or incorrect answers as examples for future problems.

Calculating Complexity

Not all correct answers are created equal. A simple problem like “1 + 1 = 2” provides very little instructional value to a model trying to solve advanced calculus. The researchers posit that complex questions—those requiring more steps to solve—make for better teachers.

RoSE measures complexity based on the length of the reasoning path. The intuition is that a longer chain of thought contains more detailed logic.

Equation for complexity calculation

In this equation, \(c_q\) (complexity) is the average number of steps in the reasoning paths associated with the most frequent answer. When adding a question to the pool, RoSE saves the specific reasoning path that has the most steps, ensuring the stored experience is as detailed as possible:

Equation for selecting the longest reasoning path

This results in an experience pool where every entry looks something like this:

Table 1: An example of the experiences stored in the experience pool.

2. Experience Orchestration

Now that we have a pool of questions tagged with Uncertainty (\(u\)) and Complexity (\(c\)), how does RoSE select the best examples to help answer a new test question (\(q_t\))?

Random selection is risky. Selecting only the most similar questions risks the “copy effect.” RoSE uses a three-stage funnel: Diversity \(\rightarrow\) Uncertainty Filtering \(\rightarrow\) Complexity Selection.

Step A: Diversity via Bucketing

First, RoSE calculates the semantic similarity between the new test question and every question in the pool. It sorts the pool from lowest similarity to highest similarity.

Instead of picking the top \(k\) most similar questions, RoSE splits the sorted questions into \(k\) uniform “buckets.” It then picks one candidate from each bucket.

  • Why? This ensures the examples cover a range of relationships—some very similar to the current problem, and some more distinct. This distribution prevents the model from overfitting to a specific sentence structure and encourages broader reasoning generalization.

Step B: Uncertainty-Based Filtering

Inside each bucket, there are many questions. Some might be incorrect answers stored previously. RoSE needs to filter these out.

However, a fixed threshold (e.g., “discard any uncertainty > 0.5”) is dangerous because uncertainty varies by task and by how full the pool is. RoSE uses a Dynamic Threshold. It looks at the minimum uncertainty found within that specific bucket and sets a threshold relative to that minimum (e.g., 1.2 times the minimum).

Equation for dynamic uncertainty filtering

This equation ensures that for every bucket, we only keep the “safest” and most confident answers available at that time, relative to their peers.

Step C: Complexity-Based Selection

Finally, from the questions remaining in the bucket (which are diverse and likely correct), RoSE picks the “winner” based on complexity. It selects the question with the highest complexity score.

Equation for selecting max complexity

The logic is elegant: Among a diverse set of confident answers, choose the ones that required the most thought to solve.

3. The Inference Step

Once the \(k\) best experiences (question, rationale, answer triplets) are selected, they are formatted into a prompt along with the new test question. The LLM then generates the final output:

Equation for LLM inference

Crucially, once this output is generated, this new test question \(q_t\) and its generated rationale \(r_t\) and answer \(a_t\) are analyzed for uncertainty and complexity and added back into the pool. The system gets smarter with every query.

Experiments and Results

The researchers evaluated RoSE on 9 different reasoning tasks, covering arithmetic (like GSM8K and SVAMP) and common sense (like StrategyQA). They compared it against standard Zero-Shot CoT, Few-Shot CoT (with human examples), and Auto-CoT.

The main results are striking:

Table 2: Main results for RoSE.

Key Takeaways from the Data:

  1. RoSE vs. Zero-Shot: On the GPT-3.5-Turbo model, RoSE improves over Zero-Shot CoT by an average of roughly 8.4 points. This confirms that the self-generated experience pool provides massive value over having no examples.
  2. RoSE vs. Manual Few-Shot: RoSE even outperforms standard Few-Shot CoT (which uses human-crafted examples) by about 5.9 points on average. This is a significant finding: automated, dynamic selection of past experiences can beat static, human-curated examples.
  3. Model Versatility: The gains are present not just on GPT-3.5, but also on the open-source LLaMA2-13B model, where RoSE improved the average score from 24.2% (Zero-Shot) to 65.7%.

Why Does It Work? (Ablation Analysis)

Is the complex orchestration really necessary? Could we just use one part of it? The researchers broke down the contribution of each component:

Figure 3: The impact of each orchestration process.

In this chart:

  • Diversity (Green): Just ensuring diverse examples (similar to Auto-CoT) provides a baseline boost.
  • Confidence/Uncertainty (Orange): Adding the uncertainty filter significantly jumps the performance. This confirms that filtering out “bad memories” is crucial.
  • Complexity/RoSE (Blue): The full RoSE model, which prioritizes complexity, yields the highest accuracy across almost all tasks.

The Value of Complexity

To further prove that “harder” examples are better, the researchers ran a comparison selecting Simple vs. Middle vs. Hard (Complex) examples.

Figure 4: The impact of complexity.

As visible in Figure 4, the “Hard” (complex) examples (represented by the light beige bars) consistently yield higher accuracy than the Simple ones. This validates the theory that exposing the model to more detailed reasoning steps helps it structure its own thinking better.

Stability and Robustness

One of the weaknesses of standard Few-Shot prompting is that performance can vary wildly depending on how many examples you provide.

Figure 5: Results on different demonstration quantities.

Figure 5 shows the accuracy (y-axis) against the number of demonstrations (x-axis).

  • Brown Line (Few-Shot-CoT): Notice how it fluctuates. Adding more examples sometimes hurts performance (e.g., in the AddSub task on the left).
  • Diamond Line (RoSE): It remains highly stable and consistently superior, regardless of whether 2, 4, or 8 demonstrations are used. This stability makes RoSE a much more reliable framework for real-world applications.

Test Order and Versatility

Finally, because RoSE is a streaming system, the order in which questions arrive changes the content of the memory pool. The researchers tested different random orders (Figure 6) and found that while there is slight fluctuation, the performance distribution remains consistently higher than baselines.

Figure 6: Results on different test orders.

They also tested RoSE on top of other prompting strategies like “Plan-and-Solve” and “Tree of Thoughts” (ToT). In all cases, adding the RoSE framework improved performance, proving it is a general-purpose enhancer.

Table 6: Comparison of various CoT methods

Conclusion and Implications

The RoSE framework represents a significant step toward autonomous large language models. By closing the loop—allowing a model to store its outputs, evaluate its own confidence, and strategically recall its best work—we move away from static text generators toward systems that learn from experience.

The key innovations of RoSE are:

  1. Orchestrated Memory: Treating past inputs not just as data, but as a queryable library of experiences.
  2. Self-Correction without Feedback: Utilizing uncertainty (consistency) to filter out errors without needing a human in the loop.
  3. Complexity Prioritization: Recognizing that deep reasoning requires exposure to deep reasoning examples.

For students and researchers in AI, RoSE demonstrates that we haven’t yet hit the ceiling of what current LLMs can do. Before building larger models, we can make the existing ones significantly smarter simply by changing how they organize and access their own “thoughts.”