Papers

[FLASHATTENTION-2: Faster Attention with Better Parallelism and Work Partitioning 🔗](https://arxiv.org/abs/2307.08691)

FlashAttention-2: Even Faster, Even More Efficient Attention for Transformers

If you’ve been following the world of large language models, you know that one of the biggest goals is expanding the context window. We want models that can read entire books, analyze lengthy codebases, or process high-resolution images. The main obstacle? The attention mechanism at the heart of the Transformer architecture. Its computational and memory costs grow quadratically with the sequence length, making long contexts prohibitively expensive. A breakthrough paper in 2022, FlashAttention, tackled this problem head-on. By cleverly reordering the attention computation to be more aware of the GPU’s memory hierarchy, it achieved linear memory usage and a 2–4× speedup over standard implementations—all without any approximation. It was a game-changer and has been widely adopted. ...

[FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness 🔗](https://arxiv.org/abs/2205.14135)

FlashAttention: Is IO-Awareness the Key to Unlocking Long-Context Transformers?

Transformers have revolutionized machine learning, but they have a well-known Achilles’ heel: the self-attention mechanism. While incredibly powerful, its computational and memory costs grow quadratically with the sequence length. This \(O(N^2)\) complexity has been a major barrier, making it prohibitively expensive to train models on long documents, high-resolution images, or lengthy audio clips. For years, researchers have tried to tame this quadratic beast with approximate attention methods. These techniques trade a bit of model accuracy for better efficiency, often reducing complexity to linear or near-linear time. But here’s the catch: many of these theoretically faster methods don’t actually speed up training in practice. They reduce the number of calculations (FLOPs), but often overlook the real bottleneck on modern hardware like GPUs: memory access. ...

[WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization 🔗](https://arxiv.org/abs/2507.15061)

Beyond Guesswork: How WebShaper Engineers Smarter AI Web Agents with Mathematical Precision

Introduction: The Data Bottleneck for Web-Savvy AI Large Language Model (LLM)-powered agents are rapidly evolving from simple chatbots into sophisticated digital assistants capable of tackling complex, open-ended tasks. Systems like OpenAI’s Deep Research, Google’s Gemini, and Perplexity AI can browse the web, gather information from multiple sources, and synthesize answers to questions that would have been impossible just a few years ago. This core capability is known as Information-Seeking (IS) — the engine driving the next generation of AI. ...

[WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent 🔗](https://arxiv.org/abs/2508.05748)

WebWatcher: Training AI Agents to See, Read, and Reason Like a Pro Researcher

AI is getting incredibly good at research. Systems like OpenAI’s DeepResearch and Google’s Gemini can now tackle complex questions by searching the web, reading documents, and synthesizing information over multiple steps. These Deep Research agents are pushing the boundaries of what AI can do. But they have a huge blind spot: they are almost entirely text-based. In the real world—and especially on the web—information isn’t just text. It’s in charts, diagrams, product images, screenshots, and infographics. An agent that can’t see is missing half the story. The next great frontier for AI agents is combining vision and language to perform truly comprehensive research. ...

[CODE2VIDEO: A CODE-CENTRIC PARADIGM FOR EDUCATIONAL VIDEO GENERATION 🔗](https://arxiv.org/abs/2510.01174)

Forget Pixels, Let's Generate Code: A Deep Dive into Code2Video for Creating Educational Videos

We’ve all seen the incredible leaps in AI-powered video generation. Models like Sora can turn a simple text prompt into a stunning, photorealistic clip. But what happens when you need to create a video that doesn’t just look good, but actually teaches something? Think about the educational videos you see on YouTube channels like 3Blue1Brown—they are packed with precise animations, clear formulas, and a logical flow that guides you through complex topics. These videos don’t just entertain; they build and reinforce knowledge step-by-step. ...

[Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents 🔗](https://arxiv.org/abs/2504.00906)

Agent S2: How a Team of AI Specialists is Mastering Your Computer

Imagine an AI assistant that could use your computer just like a human. It could book your travel, create a presentation from your notes, or manage your files — all by directly interacting with the graphical user interface (GUI): clicking icons, typing in text boxes, and dragging files. This is the promise of computer use agents — autonomous AI systems with the potential to automate countless digital tasks and dramatically boost productivity. ...

[THE UNREASONABLE EFFECTIVENESS OF SCALING AGENTS FOR COMPUTER USE 🔗](https://arxiv.org/abs/2510.02250)

One Agent Is Good, Ten Are Better: How Scaling Unlocks Near-Human Performance in AI Computer Assistants

Artificial Intelligence is becoming remarkably adept at using computers. We now have AI systems capable of booking flights, managing spreadsheets, and editing photos by directly controlling a graphical user interface (GUI) — just like a human user. These Computer-Use Agents (CUAs) promise to automate countless tedious digital tasks. But there’s a catch: while they can be impressive, they often prove fragile. A single small mistake in a long series of actions — clicking the wrong button, misinterpreting a menu, or being thrown off by a pop-up — can derail the entire task. For complex, multi-step workflows, this unreliability is a major obstacle. The same agent might succeed flawlessly one time and fail spectacularly the next, resulting in frustratingly high variance that limits practical deployment. ...

[EVOLUTION STRATEGIES AT SCALE: LLM FINE-TUNING BEYOND REINFORCEMENT LEARNING 🔗](https://arxiv.org/abs/2509.24372)

Evolution Strikes Back: A Surprisingly Powerful Way to Fine-Tune LLMs

Fine-tuning large language models (LLMs) is a critical step in making them useful for specific, real-world tasks. After a model is pre-trained on a vast corpus of text, fine-tuning adapts it to follow instructions, align with human preferences, or master specialized domains like coding, medicine, or scientific reasoning. For years, the undisputed champion of this process has been Reinforcement Learning (RL), particularly Reinforcement Learning from Human Feedback (RLHF), which powered landmark systems like ChatGPT. ...

[It's Raw! Audio Generation with State-Space Models 🔗](https://arxiv.org/abs/2202.09729)

SASHIMI: Slicing Through Raw Audio with State-Space Models

Generating realistic, high-fidelity audio is one of the grand challenges in machine learning. Think about what a raw audio waveform is: a sequence of tens of thousands of numbers—or samples—for every second of sound. To produce even a few seconds of coherent music or speech, a model needs to understand intricate local patterns (like the texture of a piano note) while simultaneously maintaining global structure over hundreds of thousands of timesteps (like an evolving melody or a spoken sentence). ...

[Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models 🔗](https://arxiv.org/abs/2208.09399)

Beyond the Gaps: A Deep Dive into SSSD for Time Series Imputation and Forecasting

Introduction: The Problem of Missing Time Imagine you’re a doctor monitoring a patient’s heart with an ECG, but the sensor glitches and you lose a few critical seconds of data. Or perhaps you’re a financial analyst tracking stock prices and your data feed suddenly has gaps. Missing data is not just inconvenient—it’s a pervasive issue in real-world applications. It can derail machine learning models, introduce bias, and lead to flawed conclusions. ...

[VideoMamba: State Space Model for Efficient Video Understanding 🔗](https://arxiv.org/abs/2403.06977)

Beyond Transformers: How VideoMamba Unlocks Efficient Long-Video Understanding

The world of video is exploding. From bite-sized clips on social media to full-length feature films, we are generating and consuming more video content than ever before. For AI, truly understanding this content is a monumental task. A single video can contain mountains of spatiotemporal information—ranging from subtle gestures to complex, multi-minute narratives. The core challenge for modern video understanding models comes down to two conflicting needs: Efficiency — Video data is massive and often highly redundant. Models must process it quickly without exhausting computational resources. Global Context — Videos aren’t just isolated frames. Understanding them requires capturing dependencies that can span hundreds or thousands of frames. The Historical Trade-Off For years, two families of models have dominated: ...

[Hungry Hungry Hippos: Towards Language Modeling with State Space Models 🔗](https://arxiv.org/abs/2212.14052)

Hungry Hippos on the Pile: A New Challenger to the Transformer Throne

For the past several years, the Transformer architecture has been the undisputed champion of language modeling. From GPT-3 to PaLM, massive Transformer models have redefined the state of the art. But this power comes at a cost: the attention mechanism—at the heart of the Transformer—scales quadratically with sequence length. Processing a sequence twice as long takes four times the computation and memory. This makes working with very long documents, codebases, or audio files a significant challenge. ...

[LocalMamba: Visual State Space Model with Windowed Selective Scan 🔗](https://arxiv.org/abs/2403.09338)

Beyond Transformers: How LocalMamba Unlocks the Power of State Space Models for Vision

For years, computer vision has been dominated by two architectural titans: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). CNNs excel at capturing local features through sliding convolutional filters, while ViTs leverage self-attention to model global relationships across an entire image. Now, a new contender has emerged from the world of sequence modeling: the State Space Model (SSM), and in particular its modern, high-performing variant, Mamba. Mamba has shown remarkable prowess in handling long 1D sequences such as text and genomics, offering linear-time complexity and impressive performance. Naturally, researchers sought to bring its advantages to vision tasks. However, initial attempts such as Vision Mamba (Vim) and VMamba, while promising, have not decisively surpassed CNNs and ViTs. This raises a critical question: ...

[Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers 🔗](https://arxiv.org/abs/2110.13985)

The Swiss Army Knife of Sequence Models: A Deep Dive into Linear State-Space Layers

Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformers have revolutionized the way we process sequential data such as text, audio, and time series. Each paradigm is powerful, but each comes with its own limitations: RNNs are efficient at inference but train slowly on long sequences and suffer from vanishing gradients. CNNs train in parallel and are fast, but they struggle beyond their fixed receptive field and have costly inference. Transformers can capture global context but scale quadratically in memory and computation with sequence length. What if we could unify the strengths of these approaches? Imagine a model with: ...

[On the Parameterization and Initialization of Diagonal State Space Models 🔗](https://arxiv.org/abs/2206.11893)

S4, But Simpler: How Diagonal State Space Models (S4D) Match Performance with Less Complexity

Introduction: The Quest for Efficient Sequence Models Modeling long sequences of data—whether audio waveforms, medical signals, text, or flattened images—is a fundamental challenge in machine learning. For years, Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) were the standard tools. More recently, Transformers have risen to prominence with remarkable results. But all of these models face trade-offs, particularly when sequences get very long. Enter State Space Models (SSMs). A recent architecture called S4 (Structured State Space for Sequences) emerged as a powerful contender, outperforming previous approaches for tasks requiring long-range memory. Built on a solid mathematical foundation from classical control theory, S4 efficiently models continuous signals with a special state matrix called the HiPPO matrix—a mathematical design aimed at remembering information over long periods. ...

[Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model 🔗](https://arxiv.org/abs/2401.09417)

Vision Mamba: A New Challenger to Transformers for Computer Vision?

For the past few years, Vision Transformers (ViTs) have dominated computer vision. By treating images as sequences of patches and applying self-attention, these models have set new benchmarks in image classification, object detection, and semantic segmentation. However, this power comes at a steep computational cost. The self-attention mechanism at the core of Transformers suffers from quadratic complexity. In plain terms, if you double the number of image patches (for example, by increasing resolution), the computation and memory demands don’t just double—they quadruple. This makes high-resolution image processing slow, memory-hungry, and often impractical without specialized hardware or cumbersome architectural workarounds. ...

[Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality 🔗](https://arxiv.org/abs/2405.21060)

Mamba‑2 Explained: The Duality Connecting State‑Space Models and Attention

Transformers dominate many sequence-modeling tasks, but their core self-attention scales quadratically with context length. That design choice makes very long contexts expensive in compute and memory. At the same time, structured state-space models (SSMs) — exemplified by S4 and Mamba — offer linear scaling in sequence length and constant state for autoregressive generation. The two model families have matured along largely separate paths: different mathematics, different optimizations, and different engineering tradeoffs. ...

[VMamba: Visual State Space Model 🔗](https://arxiv.org/abs/2401.10166)

VMamba: A New Challenger to CNNs and Transformers in Computer Vision

For the past decade, computer vision has been dominated by two architectural titans: Convolutional Neural Networks (CNNs) and, more recently, Vision Transformers (ViTs). CNNs are celebrated for their efficiency and strong inductive biases toward local patterns, while ViTs, powered by the self-attention mechanism, excel at capturing global relationships in images. However, this power comes at a cost — the self-attention mechanism has quadratic complexity (\(O(N^2)\)) with respect to the number of image patches, making ViTs computationally expensive and slow, especially for high-resolution images common in tasks like object detection and segmentation. ...

From Atoms to Applications: Unpacking a Full-Featured 2D Flash Memory Chip

Introduction: The Nanoscale Revolution Waiting to Happen For over a decade, two-dimensional (2D) materials like graphene and molybdenum disulfide (MoS₂) have been the superstars of materials science. Thinner than a single strand of human DNA, these atomic-scale sheets possess extraordinary electronic properties that promise to revolutionize computing — from ultra-fast transistors to hyper-efficient memory. They represent a potential path to continue the incredible progress of Moore’s Law, pushing beyond the physical limits of silicon. ...

The Power of Noise: How Denoising Autoencoders Learn Robust Features

Deep neural networks have become the cornerstone of modern artificial intelligence, achieving remarkable feats in areas like image recognition, natural language processing, and beyond. But before they became so dominant, there was a major hurdle: training them was incredibly difficult. The deeper the network, the harder it was to get it to learn anything useful. A key breakthrough came in the mid-2000s with the idea of unsupervised pre-training, a method of initializing a deep network layer by layer before fine-tuning it on a specific task. ...