Deep Paper

[WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization 🔗](https://arxiv.org/abs/2507.15061)

Beyond Guesswork: How WebShaper Engineers Smarter AI Web Agents with Mathematical Precision

Introduction: The Data Bottleneck for Web-Savvy AI Large Language Model (LLM)-powered agents are rapidly evolving from simple chatbots into sophisticated digital assistants capable of tackling complex, open-ended tasks. Systems like OpenAI’s Deep Research, Google’s Gemini, and Perplexity AI can browse the web, gather information from multiple sources, and synthesize answers to questions that would have been impossible just a few years ago. This core capability is known as Information-Seeking (IS) — the engine driving the next generation of AI. ...

[WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent 🔗](https://arxiv.org/abs/2508.05748)

WebWatcher: Training AI Agents to See, Read, and Reason Like a Pro Researcher

AI is getting incredibly good at research. Systems like OpenAI’s DeepResearch and Google’s Gemini can now tackle complex questions by searching the web, reading documents, and synthesizing information over multiple steps. These Deep Research agents are pushing the boundaries of what AI can do. But they have a huge blind spot: they are almost entirely text-based. In the real world—and especially on the web—information isn’t just text. It’s in charts, diagrams, product images, screenshots, and infographics. An agent that can’t see is missing half the story. The next great frontier for AI agents is combining vision and language to perform truly comprehensive research. ...

[CODE2VIDEO: A CODE-CENTRIC PARADIGM FOR EDUCATIONAL VIDEO GENERATION 🔗](https://arxiv.org/abs/2510.01174)

Forget Pixels, Let's Generate Code: A Deep Dive into Code2Video for Creating Educational Videos

We’ve all seen the incredible leaps in AI-powered video generation. Models like Sora can turn a simple text prompt into a stunning, photorealistic clip. But what happens when you need to create a video that doesn’t just look good, but actually teaches something? Think about the educational videos you see on YouTube channels like 3Blue1Brown—they are packed with precise animations, clear formulas, and a logical flow that guides you through complex topics. These videos don’t just entertain; they build and reinforce knowledge step-by-step. ...

[Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents 🔗](https://arxiv.org/abs/2504.00906)

Agent S2: How a Team of AI Specialists is Mastering Your Computer

Imagine an AI assistant that could use your computer just like a human. It could book your travel, create a presentation from your notes, or manage your files — all by directly interacting with the graphical user interface (GUI): clicking icons, typing in text boxes, and dragging files. This is the promise of computer use agents — autonomous AI systems with the potential to automate countless digital tasks and dramatically boost productivity. ...

[THE UNREASONABLE EFFECTIVENESS OF SCALING AGENTS FOR COMPUTER USE 🔗](https://arxiv.org/abs/2510.02250)

One Agent Is Good, Ten Are Better: How Scaling Unlocks Near-Human Performance in AI Computer Assistants

Artificial Intelligence is becoming remarkably adept at using computers. We now have AI systems capable of booking flights, managing spreadsheets, and editing photos by directly controlling a graphical user interface (GUI) — just like a human user. These Computer-Use Agents (CUAs) promise to automate countless tedious digital tasks. But there’s a catch: while they can be impressive, they often prove fragile. A single small mistake in a long series of actions — clicking the wrong button, misinterpreting a menu, or being thrown off by a pop-up — can derail the entire task. For complex, multi-step workflows, this unreliability is a major obstacle. The same agent might succeed flawlessly one time and fail spectacularly the next, resulting in frustratingly high variance that limits practical deployment. ...

[EVOLUTION STRATEGIES AT SCALE: LLM FINE-TUNING BEYOND REINFORCEMENT LEARNING 🔗](https://arxiv.org/abs/2509.24372)

Evolution Strikes Back: A Surprisingly Powerful Way to Fine-Tune LLMs

Fine-tuning large language models (LLMs) is a critical step in making them useful for specific, real-world tasks. After a model is pre-trained on a vast corpus of text, fine-tuning adapts it to follow instructions, align with human preferences, or master specialized domains like coding, medicine, or scientific reasoning. For years, the undisputed champion of this process has been Reinforcement Learning (RL), particularly Reinforcement Learning from Human Feedback (RLHF), which powered landmark systems like ChatGPT. ...

[It's Raw! Audio Generation with State-Space Models 🔗](https://arxiv.org/abs/2202.09729)

SASHIMI: Slicing Through Raw Audio with State-Space Models

Generating realistic, high-fidelity audio is one of the grand challenges in machine learning. Think about what a raw audio waveform is: a sequence of tens of thousands of numbers—or samples—for every second of sound. To produce even a few seconds of coherent music or speech, a model needs to understand intricate local patterns (like the texture of a piano note) while simultaneously maintaining global structure over hundreds of thousands of timesteps (like an evolving melody or a spoken sentence). ...

[Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models 🔗](https://arxiv.org/abs/2208.09399)

Beyond the Gaps: A Deep Dive into SSSD for Time Series Imputation and Forecasting

Introduction: The Problem of Missing Time Imagine you’re a doctor monitoring a patient’s heart with an ECG, but the sensor glitches and you lose a few critical seconds of data. Or perhaps you’re a financial analyst tracking stock prices and your data feed suddenly has gaps. Missing data is not just inconvenient—it’s a pervasive issue in real-world applications. It can derail machine learning models, introduce bias, and lead to flawed conclusions. ...

[VideoMamba: State Space Model for Efficient Video Understanding 🔗](https://arxiv.org/abs/2403.06977)

Beyond Transformers: How VideoMamba Unlocks Efficient Long-Video Understanding

The world of video is exploding. From bite-sized clips on social media to full-length feature films, we are generating and consuming more video content than ever before. For AI, truly understanding this content is a monumental task. A single video can contain mountains of spatiotemporal information—ranging from subtle gestures to complex, multi-minute narratives. The core challenge for modern video understanding models comes down to two conflicting needs: Efficiency — Video data is massive and often highly redundant. Models must process it quickly without exhausting computational resources. Global Context — Videos aren’t just isolated frames. Understanding them requires capturing dependencies that can span hundreds or thousands of frames. The Historical Trade-Off For years, two families of models have dominated: ...

[Hungry Hungry Hippos: Towards Language Modeling with State Space Models 🔗](https://arxiv.org/abs/2212.14052)

Hungry Hippos on the Pile: A New Challenger to the Transformer Throne

For the past several years, the Transformer architecture has been the undisputed champion of language modeling. From GPT-3 to PaLM, massive Transformer models have redefined the state of the art. But this power comes at a cost: the attention mechanism—at the heart of the Transformer—scales quadratically with sequence length. Processing a sequence twice as long takes four times the computation and memory. This makes working with very long documents, codebases, or audio files a significant challenge. ...

[LocalMamba: Visual State Space Model with Windowed Selective Scan 🔗](https://arxiv.org/abs/2403.09338)

Beyond Transformers: How LocalMamba Unlocks the Power of State Space Models for Vision

For years, computer vision has been dominated by two architectural titans: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). CNNs excel at capturing local features through sliding convolutional filters, while ViTs leverage self-attention to model global relationships across an entire image. Now, a new contender has emerged from the world of sequence modeling: the State Space Model (SSM), and in particular its modern, high-performing variant, Mamba. Mamba has shown remarkable prowess in handling long 1D sequences such as text and genomics, offering linear-time complexity and impressive performance. Naturally, researchers sought to bring its advantages to vision tasks. However, initial attempts such as Vision Mamba (Vim) and VMamba, while promising, have not decisively surpassed CNNs and ViTs. This raises a critical question: ...

[Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers 🔗](https://arxiv.org/abs/2110.13985)

The Swiss Army Knife of Sequence Models: A Deep Dive into Linear State-Space Layers

Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformers have revolutionized the way we process sequential data such as text, audio, and time series. Each paradigm is powerful, but each comes with its own limitations: RNNs are efficient at inference but train slowly on long sequences and suffer from vanishing gradients. CNNs train in parallel and are fast, but they struggle beyond their fixed receptive field and have costly inference. Transformers can capture global context but scale quadratically in memory and computation with sequence length. What if we could unify the strengths of these approaches? Imagine a model with: ...

[On the Parameterization and Initialization of Diagonal State Space Models 🔗](https://arxiv.org/abs/2206.11893)

S4, But Simpler: How Diagonal State Space Models (S4D) Match Performance with Less Complexity

Introduction: The Quest for Efficient Sequence Models Modeling long sequences of data—whether audio waveforms, medical signals, text, or flattened images—is a fundamental challenge in machine learning. For years, Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) were the standard tools. More recently, Transformers have risen to prominence with remarkable results. But all of these models face trade-offs, particularly when sequences get very long. Enter State Space Models (SSMs). A recent architecture called S4 (Structured State Space for Sequences) emerged as a powerful contender, outperforming previous approaches for tasks requiring long-range memory. Built on a solid mathematical foundation from classical control theory, S4 efficiently models continuous signals with a special state matrix called the HiPPO matrix—a mathematical design aimed at remembering information over long periods. ...

[Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model 🔗](https://arxiv.org/abs/2401.09417)

Vision Mamba: A New Challenger to Transformers for Computer Vision?

For the past few years, Vision Transformers (ViTs) have dominated computer vision. By treating images as sequences of patches and applying self-attention, these models have set new benchmarks in image classification, object detection, and semantic segmentation. However, this power comes at a steep computational cost. The self-attention mechanism at the core of Transformers suffers from quadratic complexity. In plain terms, if you double the number of image patches (for example, by increasing resolution), the computation and memory demands don’t just double—they quadruple. This makes high-resolution image processing slow, memory-hungry, and often impractical without specialized hardware or cumbersome architectural workarounds. ...

[Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality 🔗](https://arxiv.org/abs/2405.21060)

Mamba‑2 Explained: The Duality Connecting State‑Space Models and Attention

Transformers dominate many sequence-modeling tasks, but their core self-attention scales quadratically with context length. That design choice makes very long contexts expensive in compute and memory. At the same time, structured state-space models (SSMs) — exemplified by S4 and Mamba — offer linear scaling in sequence length and constant state for autoregressive generation. The two model families have matured along largely separate paths: different mathematics, different optimizations, and different engineering tradeoffs. ...

[VMamba: Visual State Space Model 🔗](https://arxiv.org/abs/2401.10166)

VMamba: A New Challenger to CNNs and Transformers in Computer Vision

For the past decade, computer vision has been dominated by two architectural titans: Convolutional Neural Networks (CNNs) and, more recently, Vision Transformers (ViTs). CNNs are celebrated for their efficiency and strong inductive biases toward local patterns, while ViTs, powered by the self-attention mechanism, excel at capturing global relationships in images. However, this power comes at a cost — the self-attention mechanism has quadratic complexity (\(O(N^2)\)) with respect to the number of image patches, making ViTs computationally expensive and slow, especially for high-resolution images common in tasks like object detection and segmentation. ...

From Atoms to Applications: Unpacking a Full-Featured 2D Flash Memory Chip

Introduction: The Nanoscale Revolution Waiting to Happen For over a decade, two-dimensional (2D) materials like graphene and molybdenum disulfide (MoS₂) have been the superstars of materials science. Thinner than a single strand of human DNA, these atomic-scale sheets possess extraordinary electronic properties that promise to revolutionize computing — from ultra-fast transistors to hyper-efficient memory. They represent a potential path to continue the incredible progress of Moore’s Law, pushing beyond the physical limits of silicon. ...

The Power of Noise: How Denoising Autoencoders Learn Robust Features

Deep neural networks have become the cornerstone of modern artificial intelligence, achieving remarkable feats in areas like image recognition, natural language processing, and beyond. But before they became so dominant, there was a major hurdle: training them was incredibly difficult. The deeper the network, the harder it was to get it to learn anything useful. A key breakthrough came in the mid-2000s with the idea of unsupervised pre-training, a method of initializing a deep network layer by layer before fine-tuning it on a specific task. ...

Unlocking Deep Learning: How a 2006 Breakthrough Revolutionized Neural Networks

High-dimensional data—like images with millions of pixels, documents with thousands of words, or genomes with countless features—can be incredibly complex to understand and analyze. This is often referred to as the curse of dimensionality: with so many variables, it becomes harder to spot meaningful patterns and relationships, making tasks like classification, visualization, or storage challenging. For decades, the preferred technique to tackle this problem was Principal Component Analysis (PCA). PCA is a linear method that finds the directions of greatest variance in a dataset and projects it into a lower-dimensional space. It’s effective and simple, but inherently limited—especially when the patterns in the data are non-linear, curving through high-dimensional space in complex ways. In such cases, PCA can fail to capture important structure. ...

[NAS-Bench-1Shot1: Benchmarking and Dissecting One-shot Neural Architecture Search 🔗](https://arxiv.org/abs/2001.10422)

Cracking the Code of One-Shot NAS: A Deep Dive into the NAS-Bench-1Shot1 Benchmark

Introduction: The Promise and Peril of Automated AI Neural Architecture Search (NAS) is one of the most exciting frontiers in machine learning. Imagine an algorithm that can automatically design the perfect neural network for your specific task, potentially outperforming architectures crafted by world-class human experts. This is the promise of NAS. Early successes proved that NAS could discover state-of-the-art models for image classification and other tasks — but at a staggering cost. The search often required thousands of GPU-days of computation, making it a luxury only accessible to a few large tech companies. ...