Deep Paper

[Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling 🔗](https://arxiv.org/abs/1412.3555)

LSTM vs. GRU: The Battle of Gated RNNs

From the melodies we listen to, the sentences we read, to the raw waveforms of our speech, the world around us is filled with sequences. For machine learning, understanding and generating this kind of data is a monumental challenge. How can a model grasp the grammatical structure of a long paragraph, or compose a melody that feels coherent from start to finish? The key lies in memory — specifically, the ability to store information over long spans of time. ...

[Rethinking the Inception Architecture for Computer Vision 🔗](https://arxiv.org/abs/1512.00567)

Smarter, Not Harder: How Google's Inception V2 and V3 Rethought Deep Learning Architecture

In the world of deep learning, there was once a powerful and seductive mantra: “just add more layers.” For a time, this seemed to be the primary path to success. AlexNet gave way to the much deeper VGGNet, and with each added layer, performance on benchmarks like ImageNet climbed higher. But this progress came at a steep price—astronomical computational costs and ballooning parameter counts. Training these behemoths required massive GPU clusters, and deploying them on resource-constrained devices like smartphones was nearly impossible. ...

[SEARCHING FOR ACTIVATION FUNCTIONS 🔗](https://arxiv.org/abs/1710.05941)

Beyond ReLU: How Automated Search Discovered the Swish Activation Function

For nearly a decade, the Rectified Linear Unit (ReLU) has been the undisputed champion of activation functions in deep learning. Its elegant simplicity—outputting the input if it’s positive and zero otherwise—was a breakthrough that unlocked practical training for very deep neural networks. Fast, effective, and easy to implement, ReLU quickly became the default choice across the AI community. Many rivals have tried to dethrone ReLU. Alternatives like Leaky ReLU, ELU, and SELU promised improvements by tweaking how ReLU handles negative inputs. Yet none managed to replace it. Gains were often inconsistent across models and datasets, leaving practitioners to fall back on ReLU’s reliable simplicity. ...

[Representation Learning: A Review and New Perspectives 🔗](https://arxiv.org/abs/1206.5538)

From Pixels to Concepts: The Power of Representation Learning

If you’ve ever trained a model, you know the grind: collect data, clean it, and then spend weeks engineering features that coax performance out of your algorithm. That manual feature engineering is often the make-or-break step—time-consuming, brittle, and domain-specific. Representation learning aims to change that. Instead of relying on human intuition to hand-craft features, we want models that discover the right internal descriptions automatically—representations that reveal the underlying explanatory factors of the data. ...

Looking Both Ways: How Bidirectional LSTMs Revolutionized Sequence Processing

Imagine listening to a friend speak. How does your brain make sense of the continuous stream of sounds? You don’t just process each sound in isolation — your understanding of a word often depends on what was said before and what will be said after. Consider the phrase: “I read the book.” Did you pronounce “read” as “reed” or “red”? You can’t know without the full context. This ability to use both past and future information is fundamental to how we understand sequences—whether it’s speech, text, or music. ...

[xLSTM: Extended Long Short-Term Memory 🔗](https://arxiv.org/abs/2405.04517)

The Return of the RNN? A Deep Dive into xLSTM

For most of the past decade, Transformers have defined the frontier of sequence modeling. Their ability to process long contexts in parallel unlocked the era of large language models (LLMs). But this progress also shifted attention away from the original sequential engines: recurrent neural networks and, in particular, LSTMs — the architecture Sepp Hochreiter helped invent. xLSTM: Extended Long Short-Term Memory revisits that lineage and asks a deceptively simple question: if we scale LSTMs using modern engineering and remove their known weaknesses, how far can they go? The short answer: a long way. The paper introduces a family of LSTM extensions that restore decisive memory updates, massively expand storage capacity, and—crucially—make parts of the architecture parallelizable and competitive with modern alternatives. In several language modeling benchmarks and synthetic tests, the resulting xLSTM models match or exceed state-of-the-art Transformers and State Space Models. ...

Unpacking the RNN Encoder–Decoder: The Paper That Taught Machines to Translate

Machine translation is one of those problems that seems deceptively simple at first glance. Can’t we just swap words from one language for another? Anyone who has tried this, or used an early translation tool, knows the comical and often nonsensical results. The sentence “The cat sat on the mat” isn’t just a collection of words; it’s a structure with grammatical rules and a specific meaning. True translation requires understanding the entire thought before expressing it in another language. ...

The Unreasonable Effectiveness of LSTMs: A Deep Dive into the 1997 Paper that Changed AI

It’s 1997. The Spice Girls are topping the charts, Titanic is about to hit theaters, and two researchers, Sepp Hochreiter and Jürgen Schmidhuber, publish a paper that will, in time, become a cornerstone of the modern AI revolution. The paper, titled Long Short-Term Memory, proposed a new kind of neural network architecture that could remember information for incredibly long periods. At the time, this was a monumental challenge. Neural networks had a memory problem — they were notoriously forgetful. Trying to get a standard Recurrent Neural Network (RNN) to remember something that happened 100 steps ago was like trying to recall the first sentence of a book after reading the whole thing. The information would almost certainly be gone, washed away by a flood of newer data. ...

[FLIPPING THE DIALOGUE: TRAINING AND EVALUATING USER LANGUAGE MODELS 🔗](https://arxiv.org/abs/2510.06552)

Why AI Assistants Make Terrible Simulated Users — And How 'Flipping the Dialogue' Fixes It

You’ve probably chatted with an AI assistant like ChatGPT, Claude, or Llama. You type a question, and it fires back with a polished, well-structured answer — articulate, exhaustive, and unfailingly polite. These models are trained to be ideal conversational partners. But here’s the catch: real human users aren’t like that. Our requests in the wild are messy. We make typos, use slang, change our minds mid-conversation, and rarely lay out our entire request in perfect order. For example: ...

[SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning 🔗](https://arxiv.org/abs/2111.13196)

Teaching Machines to Describe Videos: A Deep Dive into SWINBERT

Have you ever wondered how platforms like YouTube can automatically generate captions for videos? This task—known as video captioning—is a fascinating challenge at the intersection of computer vision and natural language processing. It requires a machine to not only see what’s happening in a video, but also to understand sequences of actions and describe them in clear, coherent, human-like language. For years, the standard approach has resembled a factory production line: ...

[Efficient Content-Based Sparse Attention with Routing Transformers 🔗](https://arxiv.org/abs/2003.05997)

Taming the Quadratic Beast — How Routing Transformers Scale to Massive Sequences

The Transformer architecture, with its powerful self-attention mechanism, has revolutionized machine learning. From generating human-like text with GPT models to creating stunning images, its impact is undeniable. At its heart, self-attention allows a model to weigh the importance of every single piece of input when processing any other piece. This gives it a comprehensive, global understanding of the data. But this power comes at a steep price: the computational and memory costs of self-attention grow quadratically with sequence length — \(O(n^2)\). This means that doubling the sequence length quadruples the cost. For sequences of a few thousand tokens, this is manageable. But what about modeling an entire book, a high-resolution image, or a full-length symphony? The quadratic scaling quickly becomes a prohibitive bottleneck, making it incredibly difficult to apply Transformers to truly long sequences. ...

[Efficient Non-Local Contrastive Attention for Image Super-Resolution 🔗](https://arxiv.org/abs/2201.03794)

Making Every Pixel Count: A Deep Dive into Efficient Non-Local Contrastive Attention

Have you ever zoomed in on a photo only to find a blurry, pixelated mess? The quest to transform that low-resolution (LR) image into a sharp, high-resolution (HR) masterpiece is the central challenge of Single Image Super-Resolution (SISR). This technology has wide-reaching applications—from enhancing medical scans for better diagnoses to clarifying surveillance footage for security purposes. For years, deep learning models have led the way in SISR, learning to map LR images to HR outputs. A major breakthrough came with the introduction of Non-Local Attention (NLA). The idea was deceptively simple: to reconstruct a patch of an image (for example, a brick in a wall), a model could look for visually similar bricks elsewhere in the image and borrow their detail. This allowed models to leverage an image’s internal correlations and textures globally, far beyond their local receptive fields. ...

[SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning 🔗](https://arxiv.org/abs/2012.09852)

SpAtten: Making Transformers Spartan by Pruning Redundant Language

Introduction: The Unbearable Slowness of Attention Transformer-based models like BERT and GPT have revolutionized Natural Language Processing (NLP), achieving state-of-the-art results on everything from sentiment analysis to text generation. They can write code, summarize articles, and even hold surprisingly coherent conversations. But this incredible power comes at a steep price: computational cost. The secret sauce of these models is the attention mechanism, a clever technique that allows them to weigh the importance of different words in a sentence. The problem? Attention has a quadratic complexity, meaning its computational cost grows with the square of the input sentence length. Processing a 100-word sentence is one thing, but processing a 1000-word document is 100 times more expensive. ...

[MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention 🔗](https://arxiv.org/abs/2407.02490)

From 30 Minutes to 3: How MInference Slashes LLM Wait Times for Million-Token Prompts

Large Language Models (LLMs) are rapidly expanding their horizons, now capable of processing context windows of a million tokens or more. This unlocks incredible applications — from understanding entire code repositories, to answering nuanced questions about lengthy legal documents, to reasoning across sprawling datasets. But with great context comes great computational cost. Consider feeding a 1M-token prompt to a state-of-the-art LLM. Even on a powerful Nvidia A100 GPU, you might have to wait 30 minutes before the model produces the first output token. This initial delay occurs during the pre-filling stage — the process of ingesting the prompt, computing attention over every token, and setting up the key-value (KV) cache for subsequent decoding. The main culprit? The Transformer’s self-attention, whose computation scales quadratically with input length. ...

[Faster Causal Attention Over Large Sequences Through Sparse Flash Attention 🔗](https://arxiv.org/abs/2306.01160)

Beyond FlashAttention: Making Transformers Even Faster with Dynamic Sparsity

Transformers are everywhere—powering tools from ChatGPT to code completion assistants—but they have a well-known Achilles’ heel: the self-attention mechanism. As you feed a Transformer longer sequences of text, the computation required for attention grows quadratically. Doubling the sequence length means quadrupling the work. This computational bottleneck makes training on very long documents, high-resolution images, or extensive codebases both difficult and expensive. Researchers have long suspected that much of this work is wasted. In practice, a token only needs to closely attend to a small subset of other tokens. This insight fueled research into sparse attention—methods that skip unnecessary computations. While some approaches rely on fixed patterns, others attempt dynamic, data-dependent strategies. ...

[DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training 🔗](https://arxiv.org/abs/2310.03294)

Unlocking Massive Contexts: A Deep Dive into DISTFLASHATTN

Large Language Models (LLMs) are rapidly evolving, and one of the most exciting frontiers is the expansion of their context windows. Imagine an AI that can read an entire novel, a full codebase, or a lengthy financial report in one go, and then answer your questions with full awareness of that entire content. This is the promise of long-context LLMs—but training them poses a formidable technical challenge. The key culprit? The self-attention mechanism, the core of Transformer architectures, whose memory usage scales quadratically with sequence length. ...

[FLASHMASK: EFFICIENT AND RICH MASK EXTENSION OF FLASHATTENTION 🔗](https://arxiv.org/abs/2410.01359)

FLASHMASK: Taming Long Sequences with Ultra-Efficient Attention Masks

The Transformer architecture powers modern AI—from ChatGPT to Gemini—thanks to its attention mechanism, which allows models to focus selectively on relevant parts of the input. But with great power comes a serious bottleneck: as sequence lengths grow to entire books or massive codebases, the computational and memory demands of attention scale quadratically. Double the input length, and you quadruple the work. This is the infamous quadratic bottleneck. Breakthroughs like FlashAttention have reduced these costs for standard use cases by avoiding expensive intermediate memory allocations. However, FlashAttention struggles when faced with the complex attention masks needed for modern training tasks—masks that dictate which tokens can “see” each other. These masks are crucial in scenarios like preference optimization, fine-tuning, or sequence packing. Current approaches often revert to dense, memory-hungry computations for such masks. ...

[FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision 🔗](https://arxiv.org/abs/2407.08608)

Unpacking FlashAttention-3: How Asynchrony and FP8 Supercharge Transformers

The Transformer architecture is the powerhouse behind today’s AI revolution, but it has one stubborn bottleneck: the attention mechanism. As we push for larger models that can process entire books, massive codebases, or hours of video, the quadratic complexity of attention becomes a major computational obstacle. Simply put, the longer the input, the more the attention mechanism struggles—and cost skyrockets. This scaling issue has sparked intense innovation in making attention faster and more efficient. A few years back, FlashAttention appeared as a breakthrough: by cleverly managing memory I/O on GPUs, it delivered exact attention at high speed without resorting to approximations. Its successor, FlashAttention-2, improved parallelism and load balancing—but even then, on cutting-edge NVIDIA H100 GPUs, it achieved only ~35% of the hardware’s theoretical maximum throughput. ...

[FLASHATTENTION-2: Faster Attention with Better Parallelism and Work Partitioning 🔗](https://arxiv.org/abs/2307.08691)

FlashAttention-2: Even Faster, Even More Efficient Attention for Transformers

If you’ve been following the world of large language models, you know that one of the biggest goals is expanding the context window. We want models that can read entire books, analyze lengthy codebases, or process high-resolution images. The main obstacle? The attention mechanism at the heart of the Transformer architecture. Its computational and memory costs grow quadratically with the sequence length, making long contexts prohibitively expensive. A breakthrough paper in 2022, FlashAttention, tackled this problem head-on. By cleverly reordering the attention computation to be more aware of the GPU’s memory hierarchy, it achieved linear memory usage and a 2–4× speedup over standard implementations—all without any approximation. It was a game-changer and has been widely adopted. ...

[FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness 🔗](https://arxiv.org/abs/2205.14135)

FlashAttention: Is IO-Awareness the Key to Unlocking Long-Context Transformers?

Transformers have revolutionized machine learning, but they have a well-known Achilles’ heel: the self-attention mechanism. While incredibly powerful, its computational and memory costs grow quadratically with the sequence length. This \(O(N^2)\) complexity has been a major barrier, making it prohibitively expensive to train models on long documents, high-resolution images, or lengthy audio clips. For years, researchers have tried to tame this quadratic beast with approximate attention methods. These techniques trade a bit of model accuracy for better efficiency, often reducing complexity to linear or near-linear time. But here’s the catch: many of these theoretically faster methods don’t actually speed up training in practice. They reduce the number of calculations (FLOPs), but often overlook the real bottleneck on modern hardware like GPUs: memory access. ...