Deep Paper

[Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective 🔗](https://arxiv.org/abs/2305.15408)

Unlocking the Black Box: The Theory Behind Chain-of-Thought in LLMs

Unlocking the Black Box: The Theory Behind Chain-of-Thought in LLMs If you’ve used modern large language models (LLMs) on hard problems, you know the trick: append a prompt like “Let’s think step-by-step” and—often—the model produces intermediate reasoning and gets the answer right. That simple change, called Chain-of-Thought (CoT) prompting, has become a practical staple for eliciting better performance on math, logic, and reasoning tasks. But why does CoT help so much? Is it just coaxing the model to reveal what it already knows, or does it fundamentally change what the model can compute? ...

Breaking the Speed Limit: How a New Algorithm is Revolutionizing Minimax Optimization

Introduction: The High-Stakes Game of Minimax Optimization Imagine a game with two players. One player—the Minimizer—wants to make a certain value as small as possible. The other—the Maximizer—wants to make that same value as large as possible. They take turns, each trying to outsmart the other. This is the essence of a minimax problem, a concept that sits at the heart of many modern machine learning marvels. From training Generative Adversarial Networks (GANs) that create photorealistic images to building AI models robust to adversarial attacks, minimax optimization is everywhere. The core task is to solve problems of the form: ...

[MONARCH MIXER: A Simple Sub-Quadratic GEMM-Based Architecture 🔗](https://arxiv.org/abs/2310.12109)

Beyond Transformers: Scaling Deep Learning Sub-Quadratically with the Monarch Mixer

For the last decade, progress in deep learning has been a tale of scale: larger models, longer contexts, and wider feature spaces. That scaling has unlocked impressive capabilities, but it’s running into a practical limit—compute cost. At the heart of many state-of-the-art models are operations whose time and memory cost grow quadratically: attention over sequence length \(N\) and dense MLPs over feature dimension \(d\). Doubling the sequence length or the model width can quadruple compute. Quadratic growth quickly becomes prohibitive as we push to longer contexts and wider networks. ...

[Random Cuts are Optimal for Explainable k-Medians 🔗](https://arxiv.org/abs/2304.09113)

Why Random Slices Are the Best Way to Explain Your Clusters

Machine learning models are often criticized for being black boxes. We feed them data, they produce an output, yet the reasoning behind their decisions is hidden. This lack of transparency can be unacceptable in high-stakes areas like medical diagnosis or loan applications, where understanding the why matters just as much as the what. Clustering—the task of grouping similar data points—is no exception. How can we trust a model’s clusters if we cannot understand the logic that creates them? ...

[Human-like Few-Shot Learning via Bayesian Reasoning over Natural Language 🔗](https://arxiv.org/abs/2306.02797)

How We Learn So Much From So Little: A Bayesian Model That Thinks in Natural Language

How do we learn new concepts so quickly? A child sees a “high-five” once or twice and can generalize to a low-five. A researcher hears about “few-shot prompting” and rapidly grasps the idea. From “1, 4, 16, 64,” we instantly infer the pattern is “powers of 4.” This ability to infer a general rule from a handful of specific examples—a process called induction—is a cornerstone of human intelligence. We do it effortlessly and across a seemingly infinite range of concepts. ...

[Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent 🔗](https://arxiv.org/abs/2306.11589)

Solving Giant Gaussian Processes with... SGD? A Deep Dive into Benign Non-Convergence

Gaussian Processes (GPs) are the Swiss Army knife of machine learning when it comes to uncertainty quantification. They offer a powerful, principled way to not only make predictions but also to estimate how confident we should be in those predictions. This makes them invaluable for high-stakes applications like drug discovery, robotics, and automated scientific exploration, where knowing what the model doesn’t know is just as important as what it predicts. ...

[Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models 🔗](https://arxiv.org/abs/2305.12827)

Beyond Fine-Tuning: A Deep Dive into Task Arithmetic and Weight Disentanglement

Introduction: The Art of Editing AI Models Massive pre-trained models like CLIP, GPT, and T5 have become the backbone of modern AI. They possess an incredible wealth of general knowledge, but to be truly useful, they often need a bit of targeted editing. We might want to teach them a new skill, align them with human values, or make them forget undesirable behaviors. The standard approach is fine-tuning, which involves further training on a specialized dataset. However, fine-tuning can be computationally expensive, and it often comes with an unwelcome trade-off: a model fine-tuned for one task may lose some of its original zero-shot capabilities on others. ...

[Entropic Neural Optimal Transport via Diffusion Processes 🔗](https://arxiv.org/abs/2211.01156)

From Schrödinger's Bridge to Neural Nets: A New End-to-End Solver for Entropic Optimal Transport

Introduction: The Challenge of Aligning Complex Data Distributions Imagine you have two collections of images: a set of blurry photos and a set of sharp, high-resolution ones. How would you teach a model to transform any blurry photo into a realistic sharp version? Or consider translating summer landscapes into winter scenes. These are examples of a fundamental challenge in modern machine learning: finding a meaningful way to map one complex probability distribution to another. ...

[LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS 🔗](https://arxiv.org/abs/2311.17245)

LightGaussian: Shrinking 3D Scenes by 15x While Boosting Rendering Speed

Figure 1: LightGaussian compresses large-scale 3D Gaussian Splatting scenes from 782 MB to 45 MB, while increasing rendering speed from 144 FPS to 237 FPS, with negligible loss in visual fidelity. Imagine creating a stunning, photorealistic 3D replica of a real-world scene that you can explore in real time. This is the promise of novel view synthesis—the art of generating unseen perspectives of a scene from a set of input images. Among the latest breakthroughs in this field, 3D Gaussian Splatting (3D-GS) has redefined the balance between speed and quality, enabling breathtaking, realistic scenes at interactive frame rates. ...

[Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis 🔗](https://arxiv.org/abs/2401.02436)

Shrinking 3D Gaussian Splatting Scenes 31× and Rendering Them 4× Faster

3D Gaussian Splatting has been turning heads in the computer graphics community by enabling photorealistic scene reconstruction and real-time rendering from just a handful of images. The technique models a scene using millions of tiny, semi-transparent, colored blobs—as 3D Gaussians—each contributing to the final picture. The catch? These reconstructed scenes are huge, often weighing in at multiple gigabytes. That makes them tricky to stream, impractical for mobile devices, and hard to integrate into VR/AR or games where every megabyte and millisecond matters. ...

[GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces 🔗](https://arxiv.org/abs/2311.17977)

GaussianShader: Bringing Realistic Reflections to Real-Time Rendering

The world of 3D graphics has been on a wild ride. For years, we’ve chased the dream of creating digital scenes so realistic they’re indistinguishable from reality. A major leap forward came with Neural Radiance Fields (NeRF), which could generate stunningly photorealistic views from a handful of photos. But NeRF had a catch: it was incredibly slow. Then, in 2023, 3D Gaussian Splatting burst onto the scene and changed everything. It offered NeRF-like quality at blazing-fast, real-time speeds. Suddenly, high-fidelity 3D rendering was possible for interactive applications. However, this new technique had an Achilles’ heel: shiny, reflective surfaces. Polished metal, glossy plastic, glazed ceramics—these would often look flat, blurry, or just plain wrong. The simple color modeling of Gaussian Splatting couldn’t capture the complex, view-dependent dance of light on reflective surfaces. ...

[GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models 🔗](https://arxiv.org/abs/2310.08529)

GaussianDreamer: From Text to Stunning 3D in 15 Minutes by Fusing 2D and 3D AI

Creating 3D assets has traditionally been the domain of skilled artists wielding complex software—a process that can take hours, if not days. The surge in generative AI, especially diffusion models, is reshaping that reality, bringing the promise of creating detailed 3D objects from simple text prompts into reach for anyone. But there’s been a catch: the two established approaches each come with trade-offs. On one side are 3D diffusion models. Trained directly on 3D data, these models excel at preserving structure and spatial coherence. They produce objects with excellent geometric consistency, but the scarcity and expense of high-quality 3D datasets limit their creative range and detailed realism. Complex prompts often yield oversimplified results. ...

[GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting 🔗](https://arxiv.org/abs/2404.19702)

From 2D Pixels to 3D Splats: How GS-LRM Reconstructs Worlds from a Handful of Images

Fig. 1: Novel-view renderings predicted by GS-LRM from object captures (top left), text-conditioned generated object images (top right), scene captures (bottom left) and text-conditioned generated scene images (bottom right, from Sora with the prompt “Tour of an art gallery with many beautiful works of art in different styles”). GS-LRM handles both objects and complex scenes with remarkable fidelity. ...

[Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields 🔗](https://arxiv.org/abs/2312.03203)

Beyond Photorealism: Feature 3DGS Brings AI Understanding to 3D Scenes

We’ve seen incredible progress in creating photorealistic 3D scenes from just a handful of 2D images. Technologies like Neural Radiance Fields (NeRFs) and, more recently, 3D Gaussian Splatting (3DGS) can generate stunning novel views of a scene, making you feel as if you’re flying a drone through a still photograph. But what if we wanted to do more than just look? What if we wanted to interact with, edit, and truly understand the 3D world we’ve just created? ...

[GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting 🔗](https://arxiv.org/abs/2311.11700)

GS-SLAM: A New Era for Real-Time 3D Mapping with Gaussian Splatting

Imagine a robot navigating your home—not just avoiding obstacles, but building a photorealistic, high-fidelity 3D model of its surroundings as it moves. Or picture an augmented reality headset that seamlessly anchors virtual objects to the physical world in perfect alignment, with realistic lighting and shadows. These futuristic applications rely on a core technology called Simultaneous Localization and Mapping (SLAM). For decades, SLAM researchers have pursued the ideal system that is fast, accurate, and capable of producing dense, detailed maps. But achieving all three has proven elusive. Traditional methods often favor speed but sacrifice detail. More recent approaches based on Neural Radiance Fields (NeRFs) produce stunning maps—but their slow rendering speeds make real-time use impractical. ...

[Mip-Splatting: Alias-free 3D Gaussian Splatting 🔗](https://arxiv.org/abs/2311.16493)

Mip-Splatting: The Secret to Crystal-Clear Zooms in 3D Gaussian Splatting

Introduction In the fast-evolving world of computer graphics and vision, few techniques have made as big a splash as 3D Gaussian Splatting (3DGS). Since its introduction in 2023, it has impressed both researchers and developers by combining photorealistic novel view synthesis with real-time rendering speeds. For many, it felt like the practical, high-speed successor to Neural Radiance Fields (NeRFs) we had been waiting for. However, as people began pushing 3DGS to its limits, cracks started to show. While it produced stunning results for camera views similar to those in the training data, it struggled severely when the viewing scale changed. Zooming in could make objects look overly thin and noisy; zooming out often caused fine details to blur into glowing artifacts. ...

[MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images 🔗](https://arxiv.org/abs/2403.14627)

MVSplat: Building Stunning 3D Worlds from Just a Handful of Photos

Creating a digital 3D replica of a real-world scene from a small set of photographs is one of the long-standing goals in computer vision and graphics. This capability—often referred to as novel view synthesis or 3D reconstruction—powers technologies ranging from virtual reality experiences and cinematic visual effects to digital twins and architectural visualization. For years, methods like Neural Radiance Fields (NeRF) have delivered breathtaking photorealistic results. But there’s a catch: they typically require dozens, sometimes hundreds, of images of a scene and can be painfully slow to train and render. Recently, 3D Gaussian Splatting (3DGS) emerged, enabling real-time rendering speeds with comparable quality. Yet, these approaches still rely on dense input imagery. ...

[3D Gaussian Splatting for Real-Time Radiance Field Rendering 🔗](https://arxiv.org/abs/2308.04079)

Real-Time Radiance Fields: A Deep Dive into 3D Gaussian Splatting

For the past few years, the world of computer graphics has been captivated by Neural Radiance Fields (NeRFs). These methods promised a groundbreaking way to capture and explore 3D scenes—producing stunningly realistic images from novel viewpoints using only a handful of photos. The results were incredible, but they came at a huge computational cost: training a high-quality NeRF could take days, and rendering a single high-resolution image could take several seconds. Real-time exploration was out of reach. ...

[Visualizing and Understanding Recurrent Networks 🔗](https://arxiv.org/abs/1506.02078)

Opening the Black Box: How LSTMs Learn Long-Range Dependencies

Recurrent Neural Networks (RNNs), and their more powerful cousins, Long Short-Term Memory networks (LSTMs), are foundational tools for processing sequential data. They’ve enabled breakthroughs in everything from language translation and image captioning to speech and handwriting generation. Yet despite their success, LSTMs have long been treated as “black boxes.” We know they work—but how they work, what they learn, why they succeed, and where they fail have remained poorly understood. ...

[Gated-Feedback Recurrent Neural Networks 🔗](https://arxiv.org/abs/1502.02367)

Rethinking Deep RNNs: The Power of Gated Feedback Connections

Recurrent Neural Networks (RNNs) are the workhorses of sequence modeling, powering everything from machine translation to text generation. A common strategy to boost their capability is to make them deep by stacking multiple recurrent layers. This stacked approach is intuitive: lower layers handle low-level, fast-changing features, while higher layers learn more abstract, slow-moving concepts. In conventional designs, information flows upward through the stack. But what if this one-way street is too restrictive? What if the high-level understanding from an upper layer could provide crucial context to a lower layer? Imagine a network writing a story: a high-level layer might track the overall plot point (the character is in danger), while a low-level layer generates the actual words. Knowing the plot point would be incredibly helpful for choosing the right vocabulary (“frantically,” “desperately”) at the character level. ...