[Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search 🔗](https://arxiv.org/abs/2405.15383)

Teaching LLMs to Code the World: A New Path for Smarter AI Agents

Imagine teaching a robot to play chess. You could show it millions of games and hope it learns the patterns, as many deep learning models do. Or, you could give it the rulebook. Armed with the rules, the robot wouldn’t just mimic past games — it could reason about any possible board position, predict outcomes, and plan its moves strategically. This rulebook is what AI researchers call a world model — an internal simulation of how the world works. ...

2024-05 · 8 min · 1509 words
[ATOKEN: A UNIFIED TOKENIZER FOR VISION 🔗](https://arxiv.org/abs/2509.14476)

One Tokenizer to Rule Them All? A Deep Dive into ATOKEN for Images, Videos, and 3D

Introduction: The Quest for a Universal Language of Vision In the world of AI, Large Language Models (LLMs) like GPT-4 have become masters of generalization. A single model can write code, translate languages, and reason about complex topics. A key ingredient in this success is the humble tokenizer—a component that breaks down all forms of text (code, prose, tables) into a shared, unified set of tokens. This “universal language” allows models to scale efficiently and transfer knowledge seamlessly across tasks. ...

2025-09 · 6 min · 1272 words
[ARE: scaling up agent environments and evaluations 🔗](https://arxiv.org/abs/2509.17158)

Beyond the ReAct Loop: Building and Testing Smarter AI Agents with ARE and Gaia2

AI agents are getting impressively good. They can search the web, book flights, and manage your calendar. But if you’ve ever used one, you know they still feel a bit… fragile. They operate in a world that conveniently pauses while they think — a luxury none of us have. The real world is messy, dynamic, and asynchronous—things happen whether our agent is ready or not. This gap between sterile lab environments and the chaotic real world is one of the biggest hurdles holding back truly useful AI assistants. ...

2025-09 · 7 min · 1477 words
[Aggregated Residual Transformations for Deep Neural Networks 🔗](https://arxiv.org/abs/1611.05431)

ResNeXt: Adding a New Dimension to Deep Neural Network Design

In deep learning, building more powerful neural networks has traditionally followed two paths: making them deeper or making them wider. The VGG architecture demonstrated the impact of depth, stacking many simple, repeated layers to great effect. ResNet introduced residual connections, enabling extremely deep networks to be trained without falling prey to the dreaded vanishing gradients. Meanwhile, Google’s Inception family charted a different course toward width, creating multi-branch modules with carefully designed parallel paths, each with specialized convolution filters. ...

2016-11 · 6 min · 1189 words
[Self-Forcing++: Towards Minute-Scale High-Quality Video Generation 🔗](https://arxiv.org/abs/2510.02283)

From Seconds to Minutes: How Self-Forcing++ Teaches AI to Generate Long Videos

The world of AI video generation is evolving at lightning speed. Models like OpenAI’s Sora, Google’s Veo, and others are producing clips with breathtaking realism, often blurring the line between synthetic and real content. Yet, for all their power, most of these state-of-the-art systems share a frustrating limitation: they can only create short videos—typically capped at 5 to 10 seconds. Why is that? The very architecture that makes them so powerful—the Diffusion Transformer (DiT)—is also their Achilles’ heel. Generating a video all at once is computationally monumental, and the cost increases exponentially with video length. It’s akin to trying to write an entire novel in one thought: theoretically possible, but wildly impractical. ...

2025-10 · 6 min · 1224 words
[STOCKBENCH: CAN LLM AGENTS TRADE STOCKS PROFITABLY IN REAL-WORLD MARKETS? 🔗](https://arxiv.org/abs/2510.02209)

Can AI Beat Wall Street? Testing LLM Agents in the Stock Market with STOCKBENCH

Large language models (LLMs) have evolved far beyond clever chatbots — they’re now powerful autonomous agents capable of reasoning, planning, and executing complex tasks. They can write code, assist in scientific discovery, and automate entire workflows in marketing or engineering. This rapid progress begs an exciting question: can these AI agents conquer one of the most challenging, high-stakes arenas in the world — the stock market? The potential is enormous. An AI capable of analyzing market data, interpreting news, and making profitable trades could transform finance. But testing whether an LLM has what it takes is not straightforward. ...

2025-10 · 6 min · 1172 words
[EXGRPO: LEARNING TO REASON FROM EXPERIENCE 🔗](https://arxiv.org/abs/2510.02245)

Don't Waste Your Mistakes: How Smart Experience Replay Unlocks Reasoning in LLMs

Large Language Models (LLMs) are getting remarkably good at complex reasoning tasks, from solving math competition problems to writing code. A key technique driving this progress is Reinforcement Learning (RL), specifically a paradigm called Reinforcement Learning from Verifiable Rewards (RLVR). In RLVR, we treat an LLM’s reasoning process—its chain of thought—as a sequence of actions. If the final answer is correct, the model gets a reward. It’s a simple yet powerful way to teach models to “think” better. ...

2025-10 · 6 min · 1272 words
[Towards General Agentic Intelligence via Environment Scaling 🔗](https://arxiv.org/abs/2509.13311)

AgentScaler: How Scaling Environments, Not Just Models, Unlocks Advanced AI Agents

A cheerful explorer monkey surrounded by symbols of science and learning — representing the curiosity and versatility of agentic AI. Imagine asking your AI assistant to plan a weekend trip to a new city. You want it to book flights that avoid layovers, find a pet-friendly hotel near the city center, reserve a table at a highly-rated vegan restaurant, and buy tickets for a museum exhibit. This isn’t a simple question-and-answer task; it’s a complex, multi-step process that requires interacting with multiple external services: an airline API, a hotel booking system, a restaurant reservation platform, and a ticket vendor. ...

2025-09 · 6 min · 1167 words
[IS IN-CONTEXT LEARNING LEARNING? 🔗](https://arxiv.org/abs/2509.10414)

Beyond the Hype: Do LLMs Actually Learn, or Just Memorize? A Deep Dive into In-Context Learning

Large Language Models (LLMs) like GPT-4 have shown a remarkable capability: they can often perform new tasks immediately after seeing only a handful of examples. Whether it’s translating sentences, classifying customer sentiment, or solving logic puzzles, you can provide a few demonstrations and the model will produce a response for a new, unseen input. This phenomenon is known as In-Context Learning (ICL)—and it’s part of what makes these models feel so versatile. ...

2025-09 · 6 min · 1139 words
[TOWARDS A PHYSICS FOUNDATION MODEL 🔗](https://arxiv.org/abs/2509.13805)

GP-hy-T: The Dawn of a Universal Physics Engine?

Introduction: From Language to the Laws of Nature In recent years, a new paradigm has reshaped the landscape of AI: the foundation model. Systems like GPT-4 have shown how a single, massive model can be trained once and then adapted to countless tasks—writing poetry, generating code, answering questions—without retraining. This “train once, deploy anywhere” philosophy has revolutionized natural language processing. Now imagine applying this concept to the physical world. What if one pre-trained model could simulate anything—whether it’s the turbulent airflow over a wing, the shockwaves from a supersonic jet, or the slow seepage of fluids through porous rock? A Physics Foundation Model (PFM) could democratize access to high-fidelity simulations, accelerate scientific discovery, and eliminate years of specialized numeric solver development for each new problem. ...

2025-09 · 6 min · 1186 words
[DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL 🔗](https://arxiv.org/abs/2509.10446)

Beyond Google: How DeepDive Teaches LLMs to Be Expert Researchers

We’ve all been there—you’re chasing down the answer to a fiendishly specific question, and a quick Google search just won’t cut it. You end up opening dozens of tabs, cross-referencing facts, and piecing together clues from scattered sources. This kind of deep search is a uniquely human skill, demanding patience, critical thinking, and the ability to connect seemingly unrelated information. For Large Language Models (LLMs), deep search is still the final frontier. They excel when answers are baked into their parameters but stumble on complex, real-world problems requiring multi-step investigation with browsing tools. The gap is especially stark between cutting-edge proprietary models and their open-source counterparts. ...

2025-09 · 6 min · 1267 words
[K2-Think: A Parameter-Efficient Reasoning System 🔗](https://arxiv.org/abs/2509.07604)

K2-THINK: How a 32B Model Punches Above Its Weight to Rival AI Giants

Figure: The official K2-THINK logo from the Institute of Foundation Models at MBZUAI. In the world of artificial intelligence, there’s a common belief: bigger is better. Large Language Models (LLMs) have ballooned to hundreds of billions, or even trillions, of parameters. These colossal systems have achieved astounding feats—but they come with trade-offs: they are expensive to train, hard to deploy, and often inaccessible for most researchers. But what if a smaller, more agile model could challenge these giants? What if clever engineering could matter more than brute-force scale? ...

2025-09 · 6 min · 1129 words
[Discovery of Unstable Singularities 🔗](https://arxiv.org/abs/2509.14185)

Balancing on a Razor's Edge: How AI is Discovering Elusive Singularities in Fluid Dynamics

The Unpredictable Dance of Fluids and the Quest for Singularities Imagine pouring cream into your coffee. The intricate swirls and eddies that form are a beautiful, everyday example of fluid dynamics. For centuries, mathematicians and physicists have used a set of equations—some dating back to Leonhard Euler in the 1750s—to describe this motion. These equations, like the Euler and Navier–Stokes equations, are the bedrock of our understanding of everything from weather patterns to the airflow over an airplane’s wing. ...

2025-09 · 8 min · 1578 words
[LIVEMCP-101: STRESS TESTING AND DIAGNOSING MCP-ENABLED AGENTS ON CHALLENGING QUERIES 🔗](https://arxiv.org/abs/2508.15760)

Putting AI Agents to the Test: Inside LiveMCP-101's Gauntlet of Real-World Challenges

Introduction: The Quest for Reliable AI Agents The science fiction dream of an AI assistant—think Iron Man’s J.A.R.V.I.S.—that can understand complex instructions, search the web, manage files, and execute multi-step plans flawlessly feels increasingly close to reality. These systems, known as AI agents, represent the next frontier in artificial intelligence. By using external “tools”—such as a web search API, a spreadsheet editor, or a booking service—agents can break free from the limits of pre-trained knowledge and operate dynamically in the real world. ...

2025-08 · 6 min · 1276 words
[The Majority is not always right: RL training for solution aggregation 🔗](https://arxiv.org/abs/2509.06870)

Beyond Majority Rule: Training LLMs to Synthesize the Best Answer from Many Guesses

When faced with a tough problem, what do you do? You might brainstorm a few different approaches, weigh their pros and cons, and then combine the best parts of each to forge a final, solid solution. It turns out we can teach Large Language Models (LLMs) to do something very similar — and it dramatically improves their ability to solve complex reasoning tasks. For years, a standard strategy for boosting LLM performance on hard problems like math or coding has been to increase the “test-time compute.” Instead of asking the model for just one answer, we ask it for many. Then we pick the most common answer — a technique called self-consistency or majority voting. It’s simple, often effective, and feels intuitive: if ten different lines of reasoning all point to the answer “42,” then “42” is probably correct. ...

2025-09 · 6 min · 1251 words
[Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate 🔗](https://arxiv.org/abs/2509.05396)

When More AI Brains Are Worse Than One: The Hidden Dangers of AI Debate

It’s a principle we learn early on: two heads are better than one. Collaboration, discussion, and debate are hallmarks of human problem-solving. By challenging each other’s assumptions and sharing different perspectives, we often arrive at better answers than any single person could produce alone. It seems natural to assume the same would hold true for artificial intelligence. In recent years, a wave of research has explored the idea of multi-agent debate, where multiple Large Language Models (LLMs) work together to solve complex problems. The premise is intuitive: if one AI makes a mistake, another can catch it. By exchanging reasoning, they can refine their arguments, reduce individual biases, and ultimately boost their collective decision-making. This approach has shown promise in everything from mathematical reasoning to generating more truthful answers. ...

2025-09 · 6 min · 1246 words
[ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute 🔗](https://arxiv.org/abs/2509.04475)

Breaking the 'Tunnel Vision' of LLMs: An In-depth Look at ParaThinker's Parallel Reasoning

Introduction: Thinking Longer vs. Thinking Wider In the relentless quest to make Large Language Models (LLMs) smarter, one strategy has dominated recent breakthroughs: scaling test-time compute. The idea is simple yet powerful—give a model more time and computational resources to “think” before producing an answer. By generating longer, more detailed chains of thought, models such as OpenAI’s o1 have demonstrated remarkable improvements in complex reasoning tasks. But this “think longer” approach is hitting a wall. Beyond a certain point, increasing a model’s computation budget yields diminishing returns. Accuracy stagnates, and the model may start “overthinking,” where additional reasoning steps don’t help—and can even hurt—performance. This raises a pivotal question: ...

2025-09 · 7 min · 1367 words
[AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning 🔗](https://arxiv.org/abs/2509.08755)

Learning by Doing: How AgentGym-RL Teaches LLMs to Solve Real-World Problems

Large Language Models (LLMs) are rapidly evolving from impressive text generators into autonomous agents capable of tackling complex, real-world tasks. Imagine an AI that can not only answer your questions but also navigate websites to book a flight, conduct multi-step scientific research, or even play a digital game. This is the frontier of AI research: creating agents that can reason, plan, and act over long horizons. But how do we teach an LLM to do this? Just like humans, the most effective way for an agent to learn is through practice—by interacting with an environment, trying things, making mistakes, and learning from the outcomes. This is the core idea behind Reinforcement Learning (RL). ...

2025-09 · 7 min · 1333 words
[ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning 🔗](https://arxiv.org/abs/2509.04903)

Beyond 'Good Enough': How ACE-RL Teaches LLMs to Master Long-Form Writing

Large Language Models (LLMs) have become incredibly adept at understanding vast amounts of text. Give them a 100-page document, and they can summarize it, answer questions about it, and find needles in the haystack. But when you flip the script and ask them to generate a long, high-quality document—like a detailed report, a compelling story, or a legal brief—they often stumble. The output might be coherent at a sentence level, yet can quickly lose focus, become repetitive, or fail to meet the specific, nuanced requirements of the prompt. ...

2025-09 · 7 min · 1340 words
[REFRAG: Rethinking RAG based Decoding 🔗](https://arxiv.org/abs/2509.01092)

REFRAG: Supercharging RAG with 30× Faster First-Token Generation

Large Language Models (LLMs) have transformed how we interact with information, but they have a well-known Achilles’ heel: their appetite for computational resources. This becomes especially apparent in Retrieval-Augmented Generation (RAG) systems, where large amounts of external text are injected into the model to help it answer questions. The more context we provide, the better the potential answer—but the slower and more expensive the process becomes. This creates a frustrating trade-off between knowledge and efficiency. ...

2025-09 · 6 min · 1133 words