Papers

[Towards General Agentic Intelligence via Environment Scaling 🔗](https://arxiv.org/abs/2509.13311)

AgentScaler: How Scaling Environments, Not Just Models, Unlocks Advanced AI Agents

A cheerful explorer monkey surrounded by symbols of science and learning — representing the curiosity and versatility of agentic AI. Imagine asking your AI assistant to plan a weekend trip to a new city. You want it to book flights that avoid layovers, find a pet-friendly hotel near the city center, reserve a table at a highly-rated vegan restaurant, and buy tickets for a museum exhibit. This isn’t a simple question-and-answer task; it’s a complex, multi-step process that requires interacting with multiple external services: an airline API, a hotel booking system, a restaurant reservation platform, and a ticket vendor. ...

[IS IN-CONTEXT LEARNING LEARNING? 🔗](https://arxiv.org/abs/2509.10414)

Beyond the Hype: Do LLMs Actually Learn, or Just Memorize? A Deep Dive into In-Context Learning

Large Language Models (LLMs) like GPT-4 have shown a remarkable capability: they can often perform new tasks immediately after seeing only a handful of examples. Whether it’s translating sentences, classifying customer sentiment, or solving logic puzzles, you can provide a few demonstrations and the model will produce a response for a new, unseen input. This phenomenon is known as In-Context Learning (ICL)—and it’s part of what makes these models feel so versatile. ...

[DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL 🔗](https://arxiv.org/abs/2509.10446)

Beyond Google: How DeepDive Teaches LLMs to Be Expert Researchers

We’ve all been there—you’re chasing down the answer to a fiendishly specific question, and a quick Google search just won’t cut it. You end up opening dozens of tabs, cross-referencing facts, and piecing together clues from scattered sources. This kind of deep search is a uniquely human skill, demanding patience, critical thinking, and the ability to connect seemingly unrelated information. For Large Language Models (LLMs), deep search is still the final frontier. They excel when answers are baked into their parameters but stumble on complex, real-world problems requiring multi-step investigation with browsing tools. The gap is especially stark between cutting-edge proprietary models and their open-source counterparts. ...

[K2-Think: A Parameter-Efficient Reasoning System 🔗](https://arxiv.org/abs/2509.07604)

K2-THINK: How a 32B Model Punches Above Its Weight to Rival AI Giants

Figure: The official K2-THINK logo from the Institute of Foundation Models at MBZUAI. In the world of artificial intelligence, there’s a common belief: bigger is better. Large Language Models (LLMs) have ballooned to hundreds of billions, or even trillions, of parameters. These colossal systems have achieved astounding feats—but they come with trade-offs: they are expensive to train, hard to deploy, and often inaccessible for most researchers. ...

[Discovery of Unstable Singularities 🔗](https://arxiv.org/abs/2509.14185)

Balancing on a Razor's Edge: How AI is Discovering Elusive Singularities in Fluid Dynamics

The Unpredictable Dance of Fluids and the Quest for Singularities Imagine pouring cream into your coffee. The intricate swirls and eddies that form are a beautiful, everyday example of fluid dynamics. For centuries, mathematicians and physicists have used a set of equations—some dating back to Leonhard Euler in the 1750s—to describe this motion. These equations, like the Euler and Navier–Stokes equations, are the bedrock of our understanding of everything from weather patterns to the airflow over an airplane’s wing. ...

[LIVEMCP-101: STRESS TESTING AND DIAGNOSING MCP-ENABLED AGENTS ON CHALLENGING QUERIES 🔗](https://arxiv.org/abs/2508.15760)

Putting AI Agents to the Test: Inside LiveMCP-101's Gauntlet of Real-World Challenges

Introduction: The Quest for Reliable AI Agents The science fiction dream of an AI assistant—think Iron Man’s J.A.R.V.I.S.—that can understand complex instructions, search the web, manage files, and execute multi-step plans flawlessly feels increasingly close to reality. These systems, known as AI agents, represent the next frontier in artificial intelligence. By using external “tools”—such as a web search API, a spreadsheet editor, or a booking service—agents can break free from the limits of pre-trained knowledge and operate dynamically in the real world. ...

[The Majority is not always right: RL training for solution aggregation 🔗](https://arxiv.org/abs/2509.06870)

Beyond Majority Rule: Training LLMs to Synthesize the Best Answer from Many Guesses

When faced with a tough problem, what do you do? You might brainstorm a few different approaches, weigh their pros and cons, and then combine the best parts of each to forge a final, solid solution. It turns out we can teach Large Language Models (LLMs) to do something very similar — and it dramatically improves their ability to solve complex reasoning tasks. For years, a standard strategy for boosting LLM performance on hard problems like math or coding has been to increase the “test-time compute.” Instead of asking the model for just one answer, we ask it for many. Then we pick the most common answer — a technique called self-consistency or majority voting. It’s simple, often effective, and feels intuitive: if ten different lines of reasoning all point to the answer “42,” then “42” is probably correct. ...

[Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate 🔗](https://arxiv.org/abs/2509.05396)

When More AI Brains Are Worse Than One: The Hidden Dangers of AI Debate

It’s a principle we learn early on: two heads are better than one. Collaboration, discussion, and debate are hallmarks of human problem-solving. By challenging each other’s assumptions and sharing different perspectives, we often arrive at better answers than any single person could produce alone. It seems natural to assume the same would hold true for artificial intelligence. In recent years, a wave of research has explored the idea of multi-agent debate, where multiple Large Language Models (LLMs) work together to solve complex problems. The premise is intuitive: if one AI makes a mistake, another can catch it. By exchanging reasoning, they can refine their arguments, reduce individual biases, and ultimately boost their collective decision-making. This approach has shown promise in everything from mathematical reasoning to generating more truthful answers. ...

[ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute 🔗](https://arxiv.org/abs/2509.04475)

Breaking the 'Tunnel Vision' of LLMs: An In-depth Look at ParaThinker's Parallel Reasoning

Introduction: Thinking Longer vs. Thinking Wider In the relentless quest to make Large Language Models (LLMs) smarter, one strategy has dominated recent breakthroughs: scaling test-time compute. The idea is simple yet powerful—give a model more time and computational resources to “think” before producing an answer. By generating longer, more detailed chains of thought, models such as OpenAI’s o1 have demonstrated remarkable improvements in complex reasoning tasks. But this “think longer” approach is hitting a wall. Beyond a certain point, increasing a model’s computation budget yields diminishing returns. Accuracy stagnates, and the model may start “overthinking,” where additional reasoning steps don’t help—and can even hurt—performance. This raises a pivotal question: ...

[AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning 🔗](https://arxiv.org/abs/2509.08755)

Learning by Doing: How AgentGym-RL Teaches LLMs to Solve Real-World Problems

Large Language Models (LLMs) are rapidly evolving from impressive text generators into autonomous agents capable of tackling complex, real-world tasks. Imagine an AI that can not only answer your questions but also navigate websites to book a flight, conduct multi-step scientific research, or even play a digital game. This is the frontier of AI research: creating agents that can reason, plan, and act over long horizons. But how do we teach an LLM to do this? Just like humans, the most effective way for an agent to learn is through practice—by interacting with an environment, trying things, making mistakes, and learning from the outcomes. This is the core idea behind Reinforcement Learning (RL). ...