Papers

Unpacking the RNN Encoder–Decoder: The Paper That Taught Machines to Translate

Machine translation is one of those problems that seems deceptively simple at first glance. Can’t we just swap words from one language for another? Anyone who has tried this, or used an early translation tool, knows the comical and often nonsensical results. The sentence “The cat sat on the mat” isn’t just a collection of words; it’s a structure with grammatical rules and a specific meaning. True translation requires understanding the entire thought before expressing it in another language. ...

[FLIPPING THE DIALOGUE: TRAINING AND EVALUATING USER LANGUAGE MODELS 🔗](https://arxiv.org/abs/2510.06552)

Why AI Assistants Make Terrible Simulated Users — And How 'Flipping the Dialogue' Fixes It

You’ve probably chatted with an AI assistant like ChatGPT, Claude, or Llama. You type a question, and it fires back with a polished, well-structured answer — articulate, exhaustive, and unfailingly polite. These models are trained to be ideal conversational partners. But here’s the catch: real human users aren’t like that. Our requests in the wild are messy. We make typos, use slang, change our minds mid-conversation, and rarely lay out our entire request in perfect order. For example: ...

[SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning 🔗](https://arxiv.org/abs/2111.13196)

Teaching Machines to Describe Videos: A Deep Dive into SWINBERT

Have you ever wondered how platforms like YouTube can automatically generate captions for videos? This task—known as video captioning—is a fascinating challenge at the intersection of computer vision and natural language processing. It requires a machine to not only see what’s happening in a video, but also to understand sequences of actions and describe them in clear, coherent, human-like language. For years, the standard approach has resembled a factory production line: ...

[Efficient Content-Based Sparse Attention with Routing Transformers 🔗](https://arxiv.org/abs/2003.05997)

Taming the Quadratic Beast — How Routing Transformers Scale to Massive Sequences

The Transformer architecture, with its powerful self-attention mechanism, has revolutionized machine learning. From generating human-like text with GPT models to creating stunning images, its impact is undeniable. At its heart, self-attention allows a model to weigh the importance of every single piece of input when processing any other piece. This gives it a comprehensive, global understanding of the data. But this power comes at a steep price: the computational and memory costs of self-attention grow quadratically with sequence length — \(O(n^2)\). This means that doubling the sequence length quadruples the cost. For sequences of a few thousand tokens, this is manageable. But what about modeling an entire book, a high-resolution image, or a full-length symphony? The quadratic scaling quickly becomes a prohibitive bottleneck, making it incredibly difficult to apply Transformers to truly long sequences. ...

[Efficient Non-Local Contrastive Attention for Image Super-Resolution 🔗](https://arxiv.org/abs/2201.03794)

Making Every Pixel Count: A Deep Dive into Efficient Non-Local Contrastive Attention

Have you ever zoomed in on a photo only to find a blurry, pixelated mess? The quest to transform that low-resolution (LR) image into a sharp, high-resolution (HR) masterpiece is the central challenge of Single Image Super-Resolution (SISR). This technology has wide-reaching applications—from enhancing medical scans for better diagnoses to clarifying surveillance footage for security purposes. For years, deep learning models have led the way in SISR, learning to map LR images to HR outputs. A major breakthrough came with the introduction of Non-Local Attention (NLA). The idea was deceptively simple: to reconstruct a patch of an image (for example, a brick in a wall), a model could look for visually similar bricks elsewhere in the image and borrow their detail. This allowed models to leverage an image’s internal correlations and textures globally, far beyond their local receptive fields. ...

[SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning 🔗](https://arxiv.org/abs/2012.09852)

SpAtten: Making Transformers Spartan by Pruning Redundant Language

Introduction: The Unbearable Slowness of Attention Transformer-based models like BERT and GPT have revolutionized Natural Language Processing (NLP), achieving state-of-the-art results on everything from sentiment analysis to text generation. They can write code, summarize articles, and even hold surprisingly coherent conversations. But this incredible power comes at a steep price: computational cost. The secret sauce of these models is the attention mechanism, a clever technique that allows them to weigh the importance of different words in a sentence. The problem? Attention has a quadratic complexity, meaning its computational cost grows with the square of the input sentence length. Processing a 100-word sentence is one thing, but processing a 1000-word document is 100 times more expensive. ...

[MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention 🔗](https://arxiv.org/abs/2407.02490)

From 30 Minutes to 3: How MInference Slashes LLM Wait Times for Million-Token Prompts

Large Language Models (LLMs) are rapidly expanding their horizons, now capable of processing context windows of a million tokens or more. This unlocks incredible applications — from understanding entire code repositories, to answering nuanced questions about lengthy legal documents, to reasoning across sprawling datasets. But with great context comes great computational cost. Consider feeding a 1M-token prompt to a state-of-the-art LLM. Even on a powerful Nvidia A100 GPU, you might have to wait 30 minutes before the model produces the first output token. This initial delay occurs during the pre-filling stage — the process of ingesting the prompt, computing attention over every token, and setting up the key-value (KV) cache for subsequent decoding. The main culprit? The Transformer’s self-attention, whose computation scales quadratically with input length. ...

[Faster Causal Attention Over Large Sequences Through Sparse Flash Attention 🔗](https://arxiv.org/abs/2306.01160)

Beyond FlashAttention: Making Transformers Even Faster with Dynamic Sparsity

Transformers are everywhere—powering tools from ChatGPT to code completion assistants—but they have a well-known Achilles’ heel: the self-attention mechanism. As you feed a Transformer longer sequences of text, the computation required for attention grows quadratically. Doubling the sequence length means quadrupling the work. This computational bottleneck makes training on very long documents, high-resolution images, or extensive codebases both difficult and expensive. Researchers have long suspected that much of this work is wasted. In practice, a token only needs to closely attend to a small subset of other tokens. This insight fueled research into sparse attention—methods that skip unnecessary computations. While some approaches rely on fixed patterns, others attempt dynamic, data-dependent strategies. ...

[DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training 🔗](https://arxiv.org/abs/2310.03294)

Unlocking Massive Contexts: A Deep Dive into DISTFLASHATTN

Large Language Models (LLMs) are rapidly evolving, and one of the most exciting frontiers is the expansion of their context windows. Imagine an AI that can read an entire novel, a full codebase, or a lengthy financial report in one go, and then answer your questions with full awareness of that entire content. This is the promise of long-context LLMs—but training them poses a formidable technical challenge. The key culprit? The self-attention mechanism, the core of Transformer architectures, whose memory usage scales quadratically with sequence length. ...

[FLASHMASK: EFFICIENT AND RICH MASK EXTENSION OF FLASHATTENTION 🔗](https://arxiv.org/abs/2410.01359)

FLASHMASK: Taming Long Sequences with Ultra-Efficient Attention Masks

The Transformer architecture powers modern AI—from ChatGPT to Gemini—thanks to its attention mechanism, which allows models to focus selectively on relevant parts of the input. But with great power comes a serious bottleneck: as sequence lengths grow to entire books or massive codebases, the computational and memory demands of attention scale quadratically. Double the input length, and you quadruple the work. This is the infamous quadratic bottleneck. Breakthroughs like FlashAttention have reduced these costs for standard use cases by avoiding expensive intermediate memory allocations. However, FlashAttention struggles when faced with the complex attention masks needed for modern training tasks—masks that dictate which tokens can “see” each other. These masks are crucial in scenarios like preference optimization, fine-tuning, or sequence packing. Current approaches often revert to dense, memory-hungry computations for such masks. ...