Deep papers, made simple

Understand the story behind every paper

[BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation 🔗](https://proceedings.mlr.press/v162/li22n/li22n.pdf)

BLIP: Bootstrapping Better Image–Text Models with Captioning + Filtering for Unified Vision-Language AI

Introduction Vision-language models, capable of both understanding images (e.g., answering questions, retrieving matching captions) and generating language about them (e.g., writing descriptive captions), have seen remarkable progress recently. However, most existing pre-trained models often excel in either “understanding” (encoder-based tasks like retrieval) or “generation” (decoder-based tasks like captioning), but rarely both. Furthermore, a significant portion of performance gains in this field has come from simply scaling up training datasets with noisy image-text pairs collected from the web, a convenient yet suboptimal source of supervision. ...

[Segment Anything 🔗](https://arxiv.org/abs/2304.02643)

Segment Anything: Building a Foundation Model for Image Segmentation

Segment Anything: Building a Foundation Model for Image Segmentation Segmentation, the task of delineating precise object boundaries in images, is fundamental to countless applications, from photo editing and robotics to medical imaging and autonomous vehicles. Despite its ubiquity, segmentation has long awaited a “foundation model” — a single, broadly applicable, and promptable model that can generalize across diverse tasks and domains, much like large language models have transformed NLP. The Segment Anything (SA) project boldly steps in to fill this gap. It introduces a novel task, a new model architecture, and an innovative data engine that collectively yield the Segment Anything Model (SAM) and the largest segmentation dataset ever: SA-1B, featuring 11 million images and an astounding 1.1 billion masks. ...

[Learning Transferable Visual Models From Natural Language Supervision 🔗](https://arxiv.org/abs/2103.00020)

CLIP Explained: Teaching Vision Models with Language (and Why It Works)

Imagine building an image classifier that doesn’t need to be retrained every time you want to recognize a new set of categories. Instead of collecting thousands of labeled photos for specific categories like “Bernese mountain dog” or “stop sign,” you simply tell the model, “A photo of a {label},” and it understands. That’s the promise of CLIP (Contrastive Language–Image Pre-training), a simple but powerful idea from OpenAI: learn a joint image–text embedding by training on (image, caption) pairs scraped from the web, and then use natural language as the interface for zero-shot classification. ...

[Vision Transformers are Robust Learners 🔗](https://arxiv.org/abs/2105.07581)

Why Vision Transformers Are Surprisingly Robust: Insights from 'Vision Transformers are Robust Learners'

Introduction Deep learning for vision has been dominated by convolutional neural networks (CNNs) for years. Recently, however, Vision Transformers (ViTs) — models built from self-attention blocks originally popularized in Natural Language Processing (NLP) — have surged to the front of the field. They match or surpass CNNs on many standard benchmarks, but accuracy on clean test sets is only part of the story. If a vision model is to be deployed in the real world, its robustness is paramount: how well does the model handle common corruptions, small perturbations, distribution shifts, or naturally challenging (adversarial) images? ...

[Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet 🔗](https://arxiv.org/abs/2101.11986)

From Pixels to Tokens: How T2T‑ViT Makes Transformers Work on ImageNet

Introduction Transformers have revolutionized language processing, but adapting their success to computer vision tasks has not been straightforward. The Vision Transformer (ViT) demonstrated that it’s possible to treat an image as a sequence of tokens and apply pure Transformer layers for classification. However, ViT often requires massive pretraining datasets (like JFT-300M) to achieve accuracy comparable to well-tuned Convolutional Neural Networks (CNNs) on a mid-size dataset such as ImageNet. The authors of “Tokens-to-Token ViT” diagnose two primary issues with the vanilla ViT when trained from scratch: ...

[Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions 🔗](https://arxiv.org/abs/2209.03430)

A Grand Tour of Multimodal AI: The Six Core Challenges Shaping the Future

Humans perceive the world through a blend of senses: sight, sound, touch, language, and more. Modern AI is trying to do the same. Multimodal machine learning studies how to combine different kinds of signals—images, audio, text, sensor data—so a system can reason about the world in a richer, more human-like way. A sweeping, synthesizing paper from researchers at Carnegie Mellon—“Foundations & Trends in Multimodal Machine Learning”—lays out a clear, principled roadmap for this field. It distills the field into three foundational principles and a taxonomy of six core technical challenges. This article is a guided tour of that roadmap: we’ll unpack the principles, walk through the six challenges (and their subproblems), show how they connect, and highlight the open questions likely to shape the next decade of multimodal research. ...

[SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing 🔗](https://arxiv.org/abs/2110.07205)

SpeechT5: One Model to Rule All Speech and Text Tasks

In artificial intelligence, models often specialize narrowly—one converts speech to text, another turns text into speech, and yet another translates spoken words from one language to another. Each model excels at its domain yet remains confined within its own modality. But what if we could collapse those boundaries and build a single versatile model capable of handling all spoken language tasks? That is the audacious goal behind SpeechT5, a research project from Microsoft introducing a unified pre-training framework for spoken language processing. Inspired by Google’s T5 (Text-to-Text Transfer Transformer), which treats every natural language processing (NLP) problem as “text-to-text,” the SpeechT5 team posed an equally bold question: Could we treat all speech tasks as “speech/text-to-speech/text”? ...

[Robust Speech Recognition via Large-Scale Weak Supervision 🔗](https://arxiv.org/abs/2212.04356)

Whisper: Inside OpenAI’s Quest for Human-Level Speech Recognition

Automatic Speech Recognition (ASR) has reached extraordinary heights. Modern systems can transcribe clear, read-aloud speech with astonishing accuracy, sometimes even surpassing human performance on academic benchmarks. However, this achievement hides a critical weakness. When exposed to everyday audio—accented speech, background noise, colloquial phrasing—their accuracy often collapses. These models are overfit to pristine lab conditions, lacking the flexibility and robustness that define a human listener. This gap between benchmark perfection and real-world reliability is the problem OpenAI’s Whisper project sets out to solve. In their paper “Robust Speech Recognition via Large-Scale Weak Supervision,” the authors propose a radical departure from typical speech recognition research. Instead of refining narrow models for specific datasets, they trained a single system on an enormous and messy corpus—680,000 hours of audio from across the internet. These weren’t meticulously annotated recordings; they were transcripts of varying quality, collected from diverse conditions and speakers. Yet, the resulting model demonstrates extraordinary generalization: multitask, multilingual, and surprisingly close to human-level robustness, all achieved without fine-tuning on target benchmarks. ...

[AST: Audio Spectrogram Transformer 🔗](https://arxiv.org/abs/2104.01778)

AST: How Vision Transformers Learned to Hear

For almost a decade, if you wanted to build a cutting-edge audio classification system, your go-to architecture was the Convolutional Neural Network (CNN). From identifying birds by their calls to recognizing spoken words, CNNs have long dominated the field. Their ability to detect local structures in audio spectrograms—the visual representations of sound—made them a natural fit. The reasoning was intuitive: just as CNNs find edges and textures in images, they can find formants, onsets, and harmonic patterns in spectrograms. To handle longer-range temporal context, researchers began stacking self-attention mechanisms or Transformers on top of CNN backbones. The resulting CNN-attention hybrids consistently advanced the state of the art. ...

[BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 🔗](https://arxiv.org/abs/2301.12597)

BLIP-2: How to Teach Giant Language Models to See—Efficiently

The past few years have marked an incredible leap in artificial intelligence. Large Language Models (LLMs) like GPT-4, LLaMA, and FlanT5 have shown that machines can generate poetry, write essays, debug code, and hold remarkably coherent conversations. Yet, despite their linguistic mastery, they have a glaring limitation: they cannot see. These models inhabit a world of pure text, blind to the visual richness of our environment. Teaching machines to understand both images and text—the field of Vision-Language Pre-training (VLP)—is a key frontier in AI research. Traditionally, this has required assembling gigantic, end-to-end models and training them from scratch on billions of image-text pairs. Such methods achieve impressive results but demand colossal computing resources and long training times, often beyond the reach of all but the largest tech companies. ...

[DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 🔗](https://arxiv.org/abs/2501.12948)

Beyond Memorization: How DeepSeek-R1 Teaches LLMs to Truly Reason

Large Language Models (LLMs) have become incredibly proficient at tasks like writing emails, summarizing articles, and even generating code. But there’s a crucial difference between fluent text generation and genuine, multi-step reasoning. Can a model solve an unfamiliar competition-level math problem? Can it debug a complex algorithm? These are the frontiers of AI—where models must transition from pattern matching to true problem-solving. Traditionally, we teach models via Supervised Fine-Tuning (SFT)—showing them thousands of high-quality examples and telling them, “do this.” But what if a model could teach itself to reason? What if, instead of being fed solutions, it learned how to find them through trial and error? ...

[Why think step by step? Reasoning emerges from the locality of experience 🔗](https://arxiv.org/abs/2304.03843)

Why Chain of Thought Works: It’s All About Local Experience

Introduction: The Mysterious Power of Thinking Humans have a remarkable ability. When faced with a problem too complex to solve in a single leap—like a tricky math problem, planning a multi-stop vacation, or even understanding a dense research paper—we can break it down. We think through it, step by step, chaining together smaller, manageable inferences until we arrive at a solution. This process of reasoning feels so natural that we rarely stop to ask a fundamental question: why does it even work? After all, thinking doesn’t give us new data from the world—it just reorganizes what we already know. ...

[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks 🔗](https://arxiv.org/abs/2005.11401)

RAG: How to Give Your LLM an Open-Book Exam

Large Language Models (LLMs) like GPT-4 are remarkable. They can write code, craft poems, summarize articles, and explain complex topics with startling fluency. This power stems from parametric memory—the vast amount of knowledge they absorb during training, stored in billions of neural parameters. But this parametric memory comes with a few well-known limitations: Static knowledge: Once the model is trained, its world knowledge is frozen. An LLM from 2022 can’t tell you who won the 2023 World Series. Hallucinations: It may generate confident but false statements—plausible-sounding fabrications of facts. Opacity: It’s difficult to inspect how or why the model arrived at an answer, or to correct a single mistaken fact without retraining. So, what if we could give LLMs a cheat sheet? Let them take an open-book exam—consulting trusted, up-to-date sources as they generate answers. ...

[Gradient Variance Reveals Failure Modes in Flow-Based Generative Models 🔗](https://arxiv.org/abs/2510.18118)

When Straight Lines Fail: How Gradient Variance Unmasks Memorization in Rectified Flows

Introduction Generative modeling has seen incredible advancements, with score-based models and neural Ordinary Differential Equation (ODE) flows learning to transform simple noise into incredibly sharp images, audio, or complex molecular structures. A particularly appealing concept in this field is the idea of “straight flows”—learning a vector field whose trajectories between a known source distribution (like a standardized Gaussian) and a target data distribution are nearly straight lines. This straightness allows for rapid, often one-step, generation from noise to data. Rectified Flows (ReFlow) are designed precisely to achieve this, aiming to iteratively straighten transport paths to enable faster and more efficient sampling. ...

[TraMamba: An Efficient and Semantic-rich Vehicle Trajectory Pre-training Model 🔗](https://arxiv.org/abs/2510.17545)

TrajMamba Explained: Fast, Purpose-aware Embeddings for Vehicle Trajectories

Introduction Vehicle GPS traces are ubiquitous, found in taxis, delivery fleets, and ride-hailing logs. Each trajectory tells a rich story of movement—where a vehicle started, how it drove (straight, turned, stopped), and its ultimate destination. Extracting this narrative as a compact vector, known as an embedding, is incredibly valuable for a variety of intelligent transportation system (ITS) applications, including trajectory prediction, efficient routing, anomaly detection, and more. However, two significant practical hurdles complicate this process. First, the “purpose” behind a trip (e.g., commuting, shopping, business) is often encoded in textual metadata, such as road names and Points of Interest (POI) descriptions. Integrating large language models (LMs) to capture these semantic nuances into trajectory encoders can introduce substantial computational overhead. Second, real-world trajectories are frequently noisy and redundant. High-frequency sampling often includes many uninformative points, like those recorded during traffic stops or steady-speed cruising, which bloat computation and can degrade the quality of trajectory representations. ...

[Towards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery 🔗](https://arxiv.org/abs/2505.17677)

Seeing Surgery in 3D: How OphNet-3D Reconstructs Hands and Tools for Microsurgical Analysis

Introduction Imagine trying to teach or evaluate the incredibly fine motor skills required for ophthalmic microsurgery, like cataract removal, using only a flat 2D video. Subtle nuances in a surgeon’s wrist orientation, finger placement, or the precise angle of a surgical tool—often on a sub-millimeter scale—are critical for successful outcomes. Yet, current skill assessment often relies on subjective expert supervision, which isn’t scalable or objective enough for modern training needs. ...

[A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning 🔗](https://arxiv.org/abs/2501.01774)

From TD to FQI: How Preconditioning Unifies Off-Policy Linear Value Estimation

Introduction Reinforcement learning thrives on bootstrapping: updating an estimate using other estimates. But combine bootstrapping with function approximation and off-policy data, and instability often appears—a phenomenon known as the “deadly triad.” Two widely used value-estimation approaches sit on opposite ends of an apparent stability spectrum: Temporal-Difference learning (TD) is simple and incremental but can diverge off-policy; Fitted Q-Iteration (FQI) is often observed to be more stable in practice, especially in batch settings. ...

[Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study 🔗](https://arxiv.org/abs/2510.21160)

Teaching Machines to 'See' Space: Grid-Based Spatial Intelligence for Autonomous Driving

Teaching Machines to ‘See’ Space: Grid-Based Spatial Intelligence for Autonomous Driving Introduction: Beyond Linguistic Shortcuts in AI Spatial Reasoning Imagine asking an autonomous vehicle: “Which car is behind the black truck, and how far is it?” For humans, answering this question involves an intuitive understanding of spatial relationships, built on an internal mental map and selective attention. We inherently grasp concepts like “left of,” “in front of,” and varying distances, combining them with what we visually focus on. ...

[Memory-Enhanced Neural Solvers for Routing Problems 🔗](https://arxiv.org/abs/2406.16424)

MEMENTO: Teaching Neural Solvers to Remember — Faster, Smarter Routing with Memory-Augmented Inference

Introduction Routing problems, such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP), are fundamental to countless real-world applications. From optimizing delivery routes and scheduling maintenance crews to intricate chip manufacturing, these problems demand efficient solutions. However, their NP-hard nature means that finding exact solutions becomes computationally intractable as problem sizes grow. Consequently, industrial applications heavily rely on sophisticated heuristics and search algorithms. In recent years, reinforcement learning (RL) has emerged as a flexible and powerful framework for learning such heuristics. Yet, a common challenge for learned policies is their inability to fully leverage the additional computational budget often available during inference time – when multiple attempts can be made to solve a single problem instance. Existing strategies typically involve stochastic sampling from a fixed policy, per-instance policy-gradient fine-tuning, or searching over a collection of pre-trained policies. Each of these approaches comes with its own trade-offs in terms of adaptability, speed, and data efficiency. ...

[Streaming Attention Approximation via Discrepancy Theory 🔗](https://arxiv.org/abs/2502.07861)

Balancing the Past: How Discrepancy Theory Compresses the KV Cache for Long-Context Transformers

Introduction Transformer decoders, the powerhouses behind modern Large Language Models (LLMs), operate by generating tokens one by one. To maintain context and avoid redundant computations, they store a growing cache of previously generated key (\(K\)) and value (\(V\)) embeddings. This “KV cache” is a critical component, but it also represents the primary memory bottleneck, especially when these models handle increasingly long contexts. Each new token adds a \(d\)-dimensional key and a \(d\)-dimensional value per attention head and layer, leading to memory requirements that scale linearly with context length. ...