Papers

[FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision 🔗](https://arxiv.org/abs/2407.08608)

Unpacking FlashAttention-3: How Asynchrony and FP8 Supercharge Transformers

The Transformer architecture is the powerhouse behind today’s AI revolution, but it has one stubborn bottleneck: the attention mechanism. As we push for larger models that can process entire books, massive codebases, or hours of video, the quadratic complexity of attention becomes a major computational obstacle. Simply put, the longer the input, the more the attention mechanism struggles—and cost skyrockets. This scaling issue has sparked intense innovation in making attention faster and more efficient. A few years back, FlashAttention appeared as a breakthrough: by cleverly managing memory I/O on GPUs, it delivered exact attention at high speed without resorting to approximations. Its successor, FlashAttention-2, improved parallelism and load balancing—but even then, on cutting-edge NVIDIA H100 GPUs, it achieved only ~35% of the hardware’s theoretical maximum throughput. ...

[FLASHATTENTION-2: Faster Attention with Better Parallelism and Work Partitioning 🔗](https://arxiv.org/abs/2307.08691)

FlashAttention-2: Even Faster, Even More Efficient Attention for Transformers

If you’ve been following the world of large language models, you know that one of the biggest goals is expanding the context window. We want models that can read entire books, analyze lengthy codebases, or process high-resolution images. The main obstacle? The attention mechanism at the heart of the Transformer architecture. Its computational and memory costs grow quadratically with the sequence length, making long contexts prohibitively expensive. A breakthrough paper in 2022, FlashAttention, tackled this problem head-on. By cleverly reordering the attention computation to be more aware of the GPU’s memory hierarchy, it achieved linear memory usage and a 2–4× speedup over standard implementations—all without any approximation. It was a game-changer and has been widely adopted. ...

[FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness 🔗](https://arxiv.org/abs/2205.14135)

FlashAttention: Is IO-Awareness the Key to Unlocking Long-Context Transformers?

Transformers have revolutionized machine learning, but they have a well-known Achilles’ heel: the self-attention mechanism. While incredibly powerful, its computational and memory costs grow quadratically with the sequence length. This \(O(N^2)\) complexity has been a major barrier, making it prohibitively expensive to train models on long documents, high-resolution images, or lengthy audio clips. For years, researchers have tried to tame this quadratic beast with approximate attention methods. These techniques trade a bit of model accuracy for better efficiency, often reducing complexity to linear or near-linear time. But here’s the catch: many of these theoretically faster methods don’t actually speed up training in practice. They reduce the number of calculations (FLOPs), but often overlook the real bottleneck on modern hardware like GPUs: memory access. ...

[WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization 🔗](https://arxiv.org/abs/2507.15061)

Beyond Guesswork: How WebShaper Engineers Smarter AI Web Agents with Mathematical Precision

Introduction: The Data Bottleneck for Web-Savvy AI Large Language Model (LLM)-powered agents are rapidly evolving from simple chatbots into sophisticated digital assistants capable of tackling complex, open-ended tasks. Systems like OpenAI’s Deep Research, Google’s Gemini, and Perplexity AI can browse the web, gather information from multiple sources, and synthesize answers to questions that would have been impossible just a few years ago. This core capability is known as Information-Seeking (IS) — the engine driving the next generation of AI. ...

[WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent 🔗](https://arxiv.org/abs/2508.05748)

WebWatcher: Training AI Agents to See, Read, and Reason Like a Pro Researcher

AI is getting incredibly good at research. Systems like OpenAI’s DeepResearch and Google’s Gemini can now tackle complex questions by searching the web, reading documents, and synthesizing information over multiple steps. These Deep Research agents are pushing the boundaries of what AI can do. But they have a huge blind spot: they are almost entirely text-based. In the real world—and especially on the web—information isn’t just text. It’s in charts, diagrams, product images, screenshots, and infographics. An agent that can’t see is missing half the story. The next great frontier for AI agents is combining vision and language to perform truly comprehensive research. ...

[CODE2VIDEO: A CODE-CENTRIC PARADIGM FOR EDUCATIONAL VIDEO GENERATION 🔗](https://arxiv.org/abs/2510.01174)

Forget Pixels, Let's Generate Code: A Deep Dive into Code2Video for Creating Educational Videos

We’ve all seen the incredible leaps in AI-powered video generation. Models like Sora can turn a simple text prompt into a stunning, photorealistic clip. But what happens when you need to create a video that doesn’t just look good, but actually teaches something? Think about the educational videos you see on YouTube channels like 3Blue1Brown—they are packed with precise animations, clear formulas, and a logical flow that guides you through complex topics. These videos don’t just entertain; they build and reinforce knowledge step-by-step. ...

[Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents 🔗](https://arxiv.org/abs/2504.00906)

Agent S2: How a Team of AI Specialists is Mastering Your Computer

Imagine an AI assistant that could use your computer just like a human. It could book your travel, create a presentation from your notes, or manage your files — all by directly interacting with the graphical user interface (GUI): clicking icons, typing in text boxes, and dragging files. This is the promise of computer use agents — autonomous AI systems with the potential to automate countless digital tasks and dramatically boost productivity. ...

[THE UNREASONABLE EFFECTIVENESS OF SCALING AGENTS FOR COMPUTER USE 🔗](https://arxiv.org/abs/2510.02250)

One Agent Is Good, Ten Are Better: How Scaling Unlocks Near-Human Performance in AI Computer Assistants

Artificial Intelligence is becoming remarkably adept at using computers. We now have AI systems capable of booking flights, managing spreadsheets, and editing photos by directly controlling a graphical user interface (GUI) — just like a human user. These Computer-Use Agents (CUAs) promise to automate countless tedious digital tasks. But there’s a catch: while they can be impressive, they often prove fragile. A single small mistake in a long series of actions — clicking the wrong button, misinterpreting a menu, or being thrown off by a pop-up — can derail the entire task. For complex, multi-step workflows, this unreliability is a major obstacle. The same agent might succeed flawlessly one time and fail spectacularly the next, resulting in frustratingly high variance that limits practical deployment. ...

[EVOLUTION STRATEGIES AT SCALE: LLM FINE-TUNING BEYOND REINFORCEMENT LEARNING 🔗](https://arxiv.org/abs/2509.24372)

Evolution Strikes Back: A Surprisingly Powerful Way to Fine-Tune LLMs

Fine-tuning large language models (LLMs) is a critical step in making them useful for specific, real-world tasks. After a model is pre-trained on a vast corpus of text, fine-tuning adapts it to follow instructions, align with human preferences, or master specialized domains like coding, medicine, or scientific reasoning. For years, the undisputed champion of this process has been Reinforcement Learning (RL), particularly Reinforcement Learning from Human Feedback (RLHF), which powered landmark systems like ChatGPT. ...

[It's Raw! Audio Generation with State-Space Models 🔗](https://arxiv.org/abs/2202.09729)

SASHIMI: Slicing Through Raw Audio with State-Space Models

Generating realistic, high-fidelity audio is one of the grand challenges in machine learning. Think about what a raw audio waveform is: a sequence of tens of thousands of numbers—or samples—for every second of sound. To produce even a few seconds of coherent music or speech, a model needs to understand intricate local patterns (like the texture of a piano note) while simultaneously maintaining global structure over hundreds of thousands of timesteps (like an evolving melody or a spoken sentence). ...