[Efficient Multi-modal Large Language Models via Progressive Consistency Distillation 🔗](https://arxiv.org/abs/2510.00515)

The Tortoise and the Hare of AI: How Gradual Learning Makes Visual AI Faster

Multi-modal Large Language Models (MLLMs) are reshaping how we interact with AI. Models like LLaVA can look at an image and hold a conversation about it—combining the seeing ability of computer vision with the reasoning power of large language models (LLMs). They’re like high-performance sports cars: incredible on the track, but they burn through fuel—in this case, computational resources—at a staggering rate. The main fuel drain? The sheer number of visual tokens. While a text prompt might be dozens of tokens, a single image is often broken into hundreds of them, and high-resolution images or multi-frame videos can explode this count further. This data flood creates a computational bottleneck—slowing inference speed and hogging memory. ...

2025-10
[Large Reasoning Models Learn Better Alignment from Flawed Thinking 🔗](https://arxiv.org/abs/2510.00938)

RECAP: Teaching AI to Think Critically by Showing It Flawed Reasoning

Large Language Models (LLMs) are becoming increasingly powerful, particularly a new class called Large Reasoning Models (LRMs). These models don’t just spit out an answer—they think by generating a step-by-step chain of thought (CoT) before coming to a conclusion. This reflective reasoning lets them tackle complex problems in math, coding, and beyond with remarkable results. But there’s a crack in the armor. Recent research has revealed that these sophisticated reasoning abilities are surprisingly brittle. A model can be nudged toward generating harmful content simply by giving its thought process a flawed starting point—this is called CoT prefilling. For example, starting the model’s chain of thought with a phrase like “I know how to do it. First…” can be enough to bypass safety training, leading to unsafe outputs. This raises a critical question: Do these models truly understand safety principles, or are they just skilled at following any reasoning path they’re given—whether good or bad? ...

2025-10
[Apriel-1.5-15B-Thinker: Mid-training is all you need 🔗](https://arxiv.org/abs/2510.01141)

Mid-Training is All You Need: How a 15B Model Reached the AI Frontier

In the world of artificial intelligence, there’s a constant arms race. Tech giants are building ever-larger models with hundreds of billions—or even trillions—of parameters, pushing the boundaries of what’s possible. But this relentless pursuit of scale comes at a cost—literally. These colossal models require immense computational power, making them expensive to train and deploy, and often locking them away behind proprietary APIs. This creates a fundamental tension: how can we achieve state-of-the-art AI reasoning without a state-of-the-art budget? Can a smaller, more accessible model compete with the giants? ...

2025-10
[THE DRAGON HATCHLING: THE MISSING LINK BETWEEN THE TRANSFORMER AND MODELS OF THE BRAIN 🔗](https://arxiv.org/abs/2509.26507)

The Dragon Hatchling: A New AI Architecture Bridging Transformers and the Brain

Transformers gave us the large language models that changed everything. They are powerful, trainable at scale, and extremely effective in practice. Yet they remain — at least partly — a mystery: dense tensors, batch-normalized stacks, and attention matrices are excellent engineering abstractions, but they don’t look much like the massively-parallel, locally-interacting network of neurons and synapses that is the human brain. The paper “THE DRAGON HATCHLING: THE MISSING LINK BETWEEN THE TRANSFORMER AND MODELS OF THE BRAIN” introduces a new family of architectures — BDH and its GPU-friendly variant BDH-GPU — that aim to bridge this gap. BDH is a graph-first, biologically inspired language-and-reasoning architecture whose GPU-friendly instantiation matches Transformer-like performance while sporting interpretable, local dynamics that look a lot like neurons and synapses. This post unpacks the core ideas, the intuition, and the key empirical findings so you can understand how BDH sits between tensors and biology. ...

2025-09

Unfolding Time: How a Simple Neural Network Learned the Rules of Language

How does the human mind handle time? It’s a question that feels both simple and impossibly complex. So much of what we do—from understanding a melody to catching a ball to having a conversation—depends on processing sequences of events as they unfold. Language, in particular, is a river of information flowing through time. The meaning of a sentence isn’t just in the words themselves, but in their order. “Dog bites man” is ordinary news; “Man bites dog” is a headline. ...

[Designing Network Design Strategies Through Gradient Path Analysis 🔗](https://arxiv.org/abs/2211.04800)

Rethinking Neural Network Design: A Deep Dive into Gradient Path Analysis

When designing deep neural networks, we usually focus on how data flows forward through the model. We stack layers, implement complex feature fusion mechanisms, and add attention modules to transform an input into the desired output. This traditional “data path” perspective has brought us powerful architectures like ResNet, DenseNet, and Transformers. But what if this forward-focused view is only half the story? What if the key to building more efficient and more powerful networks is to examine how information flows backward? ...

2022-11
[Finetuned Language Models Are Zero-Shot Learners 🔗](https://arxiv.org/abs/2109.01652)

Just Tell the Model What to Do: How Instruction Tuning Unlocks Zero-Shot Learning

Large language models (LLMs) have shown astonishing capabilities: writing code, composing essays, and answering complex questions. Much of that success rests on few-shot learning—showing a model a few examples in the prompt and letting it generalize. But few-shot prompting has drawbacks: you need examples, and you often must engineer the prompt carefully. What if we could simply tell a model, in plain English, what we want it to do—and have it do it well without any example? That’s the core question of “Finetuned Language Models Are Zero-Shot Learners” (Google Research). The paper shows that a surprisingly simple trick—instruction tuning—turns large pretrained models into strong zero-shot learners. The instruction-tuned model, FLAN (Finetuned Language Net), improves zero-shot performance across many tasks and even beats GPT-3 (175B) zero-shot on most evaluated datasets. ...

2021-09

GPT-3: The Dawn of Few-Shot Learning

The Fine-Tuning Treadmill: A Problem of Scale For years, the dominant paradigm in Natural Language Processing (NLP) has been a two-step dance. First, pre-train a massive, general-purpose language model on a vast ocean of text data. These models, such as BERT or RoBERTa, learn intricate patterns of language—grammar, facts, reasoning abilities, and even some biases. The second step is to take this powerful but general model and specialize it for a specific task through fine-tuning. ...

[Evaluating Large Language Models Trained on Code 🔗](https://arxiv.org/abs/2107.03374)

Inside Codex: The AI Pair Programmer That Powers GitHub Copilot

For decades, the idea of an AI that could write its own code has been a holy grail of computer science. We’ve seen glimpses of this future in science fiction, but in reality, teaching a machine the logic, creativity, and precision required for programming has been an immense challenge. When large language models (LLMs) like GPT-3 emerged, they revealed a surprising, albeit rudimentary, ability to generate simple code snippets from natural language prompts — even though they weren’t explicitly trained to code. ...

2021-07
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity 🔗](https://arxiv.org/abs/2101.03961)

The Switch Transformer: A Trillion-Parameter AI Model that's Surprisingly Efficient

In the world of AI—and especially in Natural Language Processing (NLP)—the mantra for the past few years has been “bigger is better.” We’ve seen a parade of colossal language models like GPT-3, T5, and Megatron, each pushing the boundaries of size and performance. Scaling these models has unlocked incredible capabilities, from writing coherent essays to generating code. But it comes at a steep price: astronomical computational costs. Training these massive dense models, where every parameter is used for every single input, requires supercomputers and consumes enormous amounts of energy. ...

2021-01