ICML 2025

[EXPLORING AND MITIGATING ADVERSARIAL MANIPULATION OF VOTING-BASED LEADERBOARDS 🔗](https://arxiv.org/abs/2501.07493)

Is Chatbot Arena Broken? How Adversaries Can Game LLM Leaderboards

Introduction In the rapidly evolving world of Artificial Intelligence, keeping score is hard. Traditional benchmarks—static lists of questions like the SATs or coding problems—are quickly becoming obsolete. Large Language Models (LLMs) are simply getting too smart for them, or worse, they have memorized the answers from their training data. To solve this, the AI community has turned to the “wisdom of the crowd.” Platforms like Chatbot Arena have become the gold standard for evaluating model performance. The premise is simple and elegant: pit two anonymous models against each other, have a human ask a question, and let the human vote on which answer is better. It feels fair, unbiased, and representative of real-world usage. ...

[Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination 🔗](https://arxiv.org/abs/2504.12714)

Beyond Self-Play: How Changing Environments Teaches AI to Cooperate with Anyone

Imagine you are a chef who has perfected a specific soup recipe with your sous-chef. You know exactly when they will chop the onions, and they know exactly when you will stir the broth. You move like a well-oiled machine. Now, imagine you step into a stranger’s kitchen. The layout is different, the stove is in a weird spot, and your new partner chops vegetables at a completely different pace. ...

[Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation 🔗](https://arxiv.org/abs/2503.01776)

Sparse is the New Dense: How CSR Beats Matryoshka for Adaptive Embeddings

In the era of Retrieval-Augmented Generation (RAG) and massive vector databases, the quality and efficiency of embeddings—those numerical vector representations of data—are paramount. We want embeddings that are rich in semantic meaning but also lightweight enough to search through millions of records in milliseconds. For a long time, the industry standard has been dense representations. Recently, Matryoshka Representation Learning (MRL) gained popularity (even being adopted by OpenAI) for its ability to create “nested” embeddings that can be truncated to different lengths. However, MRL comes with a heavy price: it requires expensive full-model retraining and often suffers from significant accuracy drops when the vectors are shortened. ...

[VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models 🔗](https://arxiv.org/abs/2502.02492)

Teaching AI to Move: How VideoJAM Solves the Motion Problem in Generative Video

Introduction The field of generative AI has moved at a breakneck pace. We have gone from blurry, postage-stamp-sized GIFs to high-definition, cinematic video generation in a matter of years. Models like Sora, Kling, and Gen-3 can render lighting, textures, and compositions that are nearly indistinguishable from reality. However, there is a catch. While these models have mastered appearance (what things look like), they often fail spectacularly at motion (how things move). ...

[Addressing Misspecification in Simulation-based Inference through Data-driven Calibration 🔗](https://arxiv.org/abs/2405.08719)

Bridging the Reality Gap: How RoPE Fixes Simulation-Based Inference with Optimal Transport

In modern science and engineering, we have moved away from modeling phenomena with a few hand-written equations. Instead, we rely on complex, stochastic computer simulations. From predicting climate change to modeling the cardiovascular system, these simulators allow us to describe intricate processes that defy simple analytical solutions. However, this reliance on simulation introduces a critical problem. We often use these simulators to solve the inverse problem: given a real-world observation (data), what were the physical parameters that generated it? This is the domain of Simulation-Based Inference (SBI). ...

[AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models 🔗](https://arxiv.org/abs/2501.16566)

Beyond "Happy" and "Sad": How AffectGPT is Revolutionizing Multimodal Emotion Understanding

1. Introduction If you have ever watched the movie Inside Out, you are familiar with the concept of “basic emotions.” In the film, a young girl’s mind is controlled by five distinct characters: Joy, Sadness, Anger, Fear, and Disgust. For decades, Artificial Intelligence researchers in the field of Multimodal Emotion Recognition (MER) have operated on a similar premise. They built systems designed to look at a video clip and categorize the human face or voice into one of these fixed, discrete buckets (often adding “Surprise” or “Neutral” to the mix). ...

[SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? 🔗](https://openreview.net/pdf?id=xZXhFg43EI)

Can AI Earn a Million Dollars? Deep Dive into the SWE-Lancer Benchmark

In the rapidly evolving world of Large Language Models (LLMs), we have seen AI systems graduate from solving simple textbook coding problems to winning medals in competitive programming. Yet, there remains a massive gap between solving a contained algorithm puzzle and navigating the messy, complex reality of professional software engineering. When OpenAI announced SWE-Bench Verified, models began to show promise, but critics argued that even these benchmarks relied too heavily on isolated tasks and unit tests that could be “gamed.” The question remained: If we deployed these models in the real freelance marketplace, would they actually get paid? ...

[From Weight-Based to State-Based Fine-Tuning: Further Memory Reduction on LoRA with Parallel Control 🔗](https://openreview.net/pdf?id=x4qvBVuzzu)

Beyond LoRA: How State-Based Control Unlocks Training 8B Models on Consumer GPUs

Introduction If you have ever tried to fine-tune a Large Language Model (LLM) on your local machine, you have likely run into the dreaded “CUDA Out of Memory” error. Modern models like LLaMA-3 are incredibly capable, but they are also massive. Even with the advent of Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA), the memory requirements often exceed what is available on standard consumer-grade hardware (like an NVIDIA RTX 3090 or 4090). ...

[Mixture of Lookup Experts 🔗](https://openreview.net/pdf?id=wUEp13rqXP)

Speed of a Dense Model, Power of an MoE: Understanding Mixture of Lookup Experts (MoLE)

Introduction In the world of Large Language Models (LLMs), we are constantly battling the “Scaling Laws.” The rule of thumb has generally been: if you want a smarter model, you need a bigger model. However, bigger models come with a steep price tag—they require massive computational power (FLOPs) and huge amounts of video memory (VRAM). To solve the computational problem, researchers turned to Mixture-of-Experts (MoE) architectures (like Mixtral 8x7B or DeepSeek-MoE). MoE models are clever; they have many parameters but only use a small fraction of them for each token generated. This keeps inference fast and cheap in terms of calculation. ...

[Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies 🔗](https://openreview.net/pdf?id=vQubr1uBUw)

Breaking the Vocabulary Barrier: How to Accelerate LLMs with Any Drafter Model

Breaking the Vocabulary Barrier: How to Accelerate LLMs with Any Drafter Model The inference speed of Large Language Models (LLMs) remains one of the primary bottlenecks in deploying generative AI. Whether you are running a chatbot, a code assistant, or a summarization tool, the cost and latency of generating text token-by-token can be prohibitive. To solve this, the community has largely adopted Speculative Decoding (SD). This technique uses a smaller, faster “drafter” model to guess upcoming tokens, which are then verified in parallel by the larger “target” model. When it works, it’s like magic: you get the exact same quality output but significantly faster. ...

[Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models 🔗](https://openreview.net/pdf?id=v77ZMzbsBA)

Cleaning Up the Mess: How Outlier Gradients Can Save Your Deep Learning Model

Introduction: The Data-Centric Shift In the world of machine learning, we often obsess over the “model.” We tweak architectures, adjust learning rates, and experiment with novel optimizers. This is the model-centric approach. However, there is a growing realization that the biggest bottleneck isn’t usually the algorithm—it’s the data. This has given rise to data-centric AI, a paradigm where the focus shifts to improving the quality of the training data itself. ...

[Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsOning Benchmark 🔗](https://openreview.net/pdf?id=v26vwjxOEz)

When Models Can't See the Logic: Inside EMMA, the Benchmark Breaking Multimodal AI

Introduction Imagine you are an interior designer. You look at an empty room and a piece of furniture. In your mind, you rotate the furniture, place it against the back wall, and visualize how the light hits it. You haven’t moved a muscle, but you have performed a complex feat of multimodal reasoning. You combined visual perception with spatial logic. Now, consider the state of Artificial Intelligence. We know that Large Language Models (LLMs) like GPT-4o or Claude 3.5 are incredible at text-based reasoning. They can pass bar exams and solve complex riddles. We also know they can “see” images. But can they actually reason with those images in the way humans do? Can they perform that mental rotation, or simulate a physics experiment in their “mind’s eye”? ...

[Fully Dynamic Euclidean Bi-Chromatic Matching in Sublinear Update Time 🔗](https://arxiv.org/abs/2505.09010)

Matching Red and Blue Points at Light Speed: A Breakthrough in Dynamic Geometric Algorithms

Introduction Imagine two clouds of points floating in a 2D plane: one cloud is red, the other is blue. Your task is to pair every red point with a blue point such that the sum of the distances between the paired points is minimized. This is the Euclidean Bi-Chromatic Matching problem. While this sounds like a pure geometry puzzle, it is actually the computational backbone of the 1-Wasserstein distance (also known as the Earth Mover’s Distance). This metric is ubiquitous in modern computer science, used for everything from comparing probability distributions in Machine Learning (like in WGANs) to analyzing drift in time-series data and comparing images in Computer Vision. ...

[Model Immunization from a Condition Number Perspective 🔗](https://arxiv.org/abs/2505.23760)

Vaccinating AI: How Linear Algebra Can Stop Model Misuse

The open-source AI revolution has democratized access to powerful tools, from Large Language Models (LLMs) to text-to-image generators. However, this accessibility comes with a significant risk: malicious fine-tuning. A bad actor can take a safe, publicly available model and fine-tune it on a small dataset of harmful content—be it creating non-consensual deepfakes, generating hate speech, or designing malware. This leads to a pressing safety question: Can we release a model that is “immune” to being taught bad behaviors, while still remaining useful for its intended purpose? ...

[Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics 🔗](https://arxiv.org/abs/2503.06366)

Can AI Generate Mathematical Conjectures? Bridging Machine Learning and Algebraic Combinatorics

The intersection of Artificial Intelligence and mathematics is currently one of the most exciting frontiers in science. When we think of “AI for Math,” we often imagine Large Language Models (LLMs) writing formal proofs or solving high school calculus word problems. However, the workflow of a professional mathematician involves much more than just writing down a proof. Before a theorem is proven, it must be conjectured. And before it is conjectured, a mathematician usually spends weeks or months generating “raw data”—calculating examples, drawing diagrams, and searching for patterns in discrete structures. This phase—the intuition-building and conjecturing phase—is where a new paper argues Machine Learning (ML) can shine. ...

[VideoRoPE: What Makes for Good Video Rotary Position Embedding? 🔗](https://openreview.net/pdf?id=tO7OVZkCo1)

Unlocking Long-Form Video Understanding: A Deep Dive into VideoRoPE

The capabilities of Large Language Models (LLMs) have exploded in recent years, largely due to their ability to process massive amounts of text. But as we move from text to video, we hit a new wall. Video isn’t just “text with pictures”—it is a complex, three-dimensional medium combining spatial details (what is happening in the frame) with temporal progression (when it is happening). Most current Video LLMs try to adapt text-based techniques directly to video, often with mixed results. The most critical component in this adaptation is Position Embedding—the way the model knows where a piece of information is located. ...

[Referring 3D Gaussian Splatting Segmentation 🔗](https://arxiv.org/abs/2508.08252)

Beyond Class Names - Finding Objects in 3D Scenes with Natural Language and ReferSplat

Introduction Imagine you are in a cluttered kitchen and you ask a robot to “pick up the red mug next to the laptop.” For a human, this is a trivial task. We process the semantic meaning (“red mug”), but crucially, we also process the spatial relationship (“next to the laptop”) to distinguish it from a red mug that might be on the drying rack. In the world of 3D computer vision, however, this simple request is a massive hurdle. While recent advances in 3D Gaussian Splatting (3DGS) have revolutionized how we render 3D scenes, enabling real-time, photorealistic views, the ability to understand and segment specific objects within those scenes based on complex language is lagging behind. ...

[DISTILLM-2: A Contrastive Approach Boosts the Distillation of LLMs 🔗](https://arxiv.org/abs/2503.07067)

Distilling Giants: How DISTILLM-2 Uses Contrastive Learning to Build Better Small LLMs

The race for larger, more capable Large Language Models (LLMs) has dominated headlines, but a parallel revolution is happening in the world of efficiency. Deploying massive models like GPT-4 or Llama-3-70B is computationally expensive and slow. This has driven the need for Knowledge Distillation (KD)—the process of compressing the intelligence of a massive “teacher” model into a smaller, faster “student” model. While KD is effective, standard methods often treat all training data effectively the same, regardless of whether it came from the genius teacher or the learning student. This lack of nuance leads to suboptimal compression. ...

[Rényi Neural Processes 🔗](https://arxiv.org/abs/2405.15991)

Fixing the Flaw in Neural Processes: A Deep Dive into Rényi Divergence

Fixing the Flaw in Neural Processes: A Deep Dive into Rényi Divergence In the world of probabilistic deep learning, Neural Processes (NPs) occupy a fascinating middle ground. They attempt to combine the flexibility of deep neural networks with the data-efficiency and uncertainty estimation of Gaussian Processes (GPs). If you have ever worked with meta-learning or few-shot learning, you know the dream: a model that can look at a handful of context points and immediately predict a distribution over functions for new target points. ...

[Learning Time-Varying Multi-Region Brain Communications via Scalable Markovian Gaussian Processes 🔗](https://arxiv.org/abs/2407.00397)

Unlocking the Brain’s Dynamic Chatroom: How Adaptive Delay Models Reveal Time-Varying Neural Communication

The human brain is often compared to a complex orchestra. Distinct regions—like the sections of strings, woodwinds, and percussion—must perform in perfect synchrony to produce a coherent symphony of thought and action. However, unlike a standard orchestra where the speed of sound is constant, the “communication speed” between brain regions is constantly shifting. Sometimes regions talk to each other instantly; other times, the signal lags, reflecting different cognitive processes like surprise, attention, or inhibition. ...