Papers

[LANGUAGE MODELS THAT THINK, CHAT BETTER 🔗](https://arxiv.org/abs/2509.20357)

Beyond Math Puzzles: How Teaching LLMs to 'Think' Unlocks Superior Chat Performance

Introduction: The Power of Thinking Before You Speak We’ve all heard the advice, “think before you speak.” It’s a core aspect of human intelligence—the ability to pause, reason through the consequences, and formulate a thoughtful response. Nobel laureate Daniel Kahneman described this reflective, deliberate process as System 2 thinking: the kind of mental effort that distinguishes a knee-jerk reaction from a reasoned argument. For much of their existence, Large Language Models (LLMs) have operated more like System 1 thinkers: remarkably fast, impressively fluent, but too often shallow in reasoning. Recent research has sought to change that by teaching models to “think” before answering, using a strategy called Reinforcement Learning with Verifiable Rewards (RLVR). In RLVR, a model generates a long chain of thought (CoT) before producing its answer, and earns a reward when the final answer can be automatically verified as correct. This works extremely well in math and code—where correctness is objective. If the math checks out or the code passes all the unit tests, the model gets rewarded. ...

[ARK-V1: An LLM-Agent for Knowledge Graph Question Answering Requiring Commonsense Reasoning 🔗](https://arxiv.org/abs/2509.18063)

Meet ARK-V1: An LLM Agent That Navigates Knowledge Graphs for Smarter QA

Large Language Models (LLMs) like GPT-4 and Claude are incredible reasoning engines. You can ask them almost anything, and they’ll produce a coherent—often correct—answer. But they have an Achilles’ heel: their knowledge is internalized. It’s baked in during training, and once the training finishes, that knowledge becomes static. This means it can be outdated, incorrect, or simply missing, especially for specialized or rapidly changing domains. This leads to the infamous problem of hallucination, where an LLM confidently states something factually wrong. ...

[LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures 🔗](https://arxiv.org/abs/2509.14252)

Can LLMs Learn a Trick from Computer Vision? Introducing LLM-JEPA

Large Language Models (LLMs) have taken the world by storm, and their remarkable abilities stem from a deceptively simple principle: predict the next word. This approach, known as autoregressive generation or input-space reconstruction, has been the bedrock of models like GPT, Llama, and Gemma. But what if this cornerstone of LLM training is also a limitation? In computer vision, researchers have discovered that moving away from raw pixel reconstruction and instead training in a more abstract embedding space yields far superior results. A leading paradigm here is the Joint Embedding Predictive Architecture (JEPA), which encourages models to understand the essence of an image rather than memorize superficial details. ...

[Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning 🔗](https://arxiv.org/abs/2509.13351)

Teaching Language Models to Think Before They Act: A Deep Dive into the PDDL-INSTRUCT Framework

Large Language Models (LLMs) like GPT-4 and Llama-3 have taken the world by storm. They can write poetry, debug code, and even ace university exams. But ask one to perform a task that requires strict, step-by-step logical reasoning—like assembling a complex piece of furniture or planning a logistics route—and you might find the cracks in their armor. While LLMs are masters of language and general knowledge, they often stumble when faced with problems that demand formal, structured planning. They might propose impossible actions, overlook consequences of previous steps, or fail to detect when a goal has been met. ...

[ATOKEN: A UNIFIED TOKENIZER FOR VISION 🔗](https://arxiv.org/abs/2509.14476)

One Tokenizer to Rule Them All? A Deep Dive into ATOKEN for Images, Videos, and 3D

Introduction: The Quest for a Universal Language of Vision In the world of AI, Large Language Models (LLMs) like GPT-4 have become masters of generalization. A single model can write code, translate languages, and reason about complex topics. A key ingredient in this success is the humble tokenizer—a component that breaks down all forms of text (code, prose, tables) into a shared, unified set of tokens. This “universal language” allows models to scale efficiently and transfer knowledge seamlessly across tasks. ...

[ARE: scaling up agent environments and evaluations 🔗](https://arxiv.org/abs/2509.17158)

Beyond the ReAct Loop: Building and Testing Smarter AI Agents with ARE and Gaia2

AI agents are getting impressively good. They can search the web, book flights, and manage your calendar. But if you’ve ever used one, you know they still feel a bit… fragile. They operate in a world that conveniently pauses while they think — a luxury none of us have. The real world is messy, dynamic, and asynchronous—things happen whether our agent is ready or not. This gap between sterile lab environments and the chaotic real world is one of the biggest hurdles holding back truly useful AI assistants. ...

[Aggregated Residual Transformations for Deep Neural Networks 🔗](https://arxiv.org/abs/1611.05431)

ResNeXt: Adding a New Dimension to Deep Neural Network Design

In deep learning, building more powerful neural networks has traditionally followed two paths: making them deeper or making them wider. The VGG architecture demonstrated the impact of depth, stacking many simple, repeated layers to great effect. ResNet introduced residual connections, enabling extremely deep networks to be trained without falling prey to the dreaded vanishing gradients. Meanwhile, Google’s Inception family charted a different course toward width, creating multi-branch modules with carefully designed parallel paths, each with specialized convolution filters. ...

[Self-Forcing++: Towards Minute-Scale High-Quality Video Generation 🔗](https://arxiv.org/abs/2510.02283)

From Seconds to Minutes: How Self-Forcing++ Teaches AI to Generate Long Videos

The world of AI video generation is evolving at lightning speed. Models like OpenAI’s Sora, Google’s Veo, and others are producing clips with breathtaking realism, often blurring the line between synthetic and real content. Yet, for all their power, most of these state-of-the-art systems share a frustrating limitation: they can only create short videos—typically capped at 5 to 10 seconds. Why is that? The very architecture that makes them so powerful—the Diffusion Transformer (DiT)—is also their Achilles’ heel. Generating a video all at once is computationally monumental, and the cost increases exponentially with video length. It’s akin to trying to write an entire novel in one thought: theoretically possible, but wildly impractical. ...

[STOCKBENCH: CAN LLM AGENTS TRADE STOCKS PROFITABLY IN REAL-WORLD MARKETS? 🔗](https://arxiv.org/abs/2510.02209)

Can AI Beat Wall Street? Testing LLM Agents in the Stock Market with STOCKBENCH

Large language models (LLMs) have evolved far beyond clever chatbots — they’re now powerful autonomous agents capable of reasoning, planning, and executing complex tasks. They can write code, assist in scientific discovery, and automate entire workflows in marketing or engineering. This rapid progress begs an exciting question: can these AI agents conquer one of the most challenging, high-stakes arenas in the world — the stock market? The potential is enormous. An AI capable of analyzing market data, interpreting news, and making profitable trades could transform finance. But testing whether an LLM has what it takes is not straightforward. ...

[EXGRPO: LEARNING TO REASON FROM EXPERIENCE 🔗](https://arxiv.org/abs/2510.02245)

Don't Waste Your Mistakes: How Smart Experience Replay Unlocks Reasoning in LLMs

Large Language Models (LLMs) are getting remarkably good at complex reasoning tasks, from solving math competition problems to writing code. A key technique driving this progress is Reinforcement Learning (RL), specifically a paradigm called Reinforcement Learning from Verifiable Rewards (RLVR). In RLVR, we treat an LLM’s reasoning process—its chain of thought—as a sequence of actions. If the final answer is correct, the model gets a reward. It’s a simple yet powerful way to teach models to “think” better. ...