ICML 2025

[Great Models Think Alike and this Undermines AI Oversight 🔗](https://arxiv.org/abs/2502.04313)

When Great Models Think Alike: Why AI Oversight Needs Diversity

Introduction We are witnessing an era where Machine Learning models are improving at a breakneck pace. Scaling up training data and compute has birthed Large Language Models (LLMs) that can pass bar exams, write code, and solve complex logic puzzles. But as these models approach and potentially surpass human capability, we face a bottleneck: Evaluation. How do we supervise a system that is smarter or faster than we are? Collecting high-quality human annotations is slow and expensive. The industry’s answer has been “AI Oversight”—using one AI to grade or teach another. We see this in “LLM-as-a-judge” leaderboards and “Weak-to-Strong” generalization experiments. ...

[Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner 🔗](https://arxiv.org/abs/2506.01301)

Scaling Social Intelligence: How Weak Models Can Teach Strong Giants Theory of Mind

Scaling Social Intelligence: How Weak Models Can Teach Strong Giants Theory of Mind Imagine you are watching a silent video of a person walking into a kitchen. They open the fridge, look inside, close it, sigh, and then walk over to a cabinet. Without hearing a word, you instantly infer a complex mental state: They are hungry, they wanted something specific (maybe an apple) that wasn’t in the fridge, and now they believe it might be in the cabinet. ...

[Doubly Robust Conformalized Survival Analysis with Right-Censored Data 🔗](https://arxiv.org/abs/2412.09729)

How to Predict Survival Times When You Don't Know the End of the Story

Predicting the future is difficult. Predicting exactly when a critical event—like the failure of a machine part or, more importantly, the survival time of a patient in a clinical trial—will occur is even harder. In high-stakes fields like healthcare, a simple “best guess” isn’t enough. Doctors and patients need to know the uncertainty of that guess. They need a guarantee: “We are 90% sure the patient will survive at least \(X\) months.” ...

[Training Deep Learning Models with Norm-Constrained LMOs 🔗](https://arxiv.org/abs/2502.07529)

Beyond Adam: Training Neural Networks with Geometry-Aware Optimization (SCION)

Introduction If you have ever trained a deep learning model, you have likely used Adam or AdamW. These adaptive optimizers are the engines of modern AI, powering everything from simple classifiers to massive LLMs. They work by adapting to the geometry of the loss landscape “on-the-fly,” adjusting step sizes based on the gradients they encounter during training. But here is a provocative question: Why are we treating neural networks as black-box optimization problems? ...

[Generalized Random Forests using Fixed-Point Trees 🔗](https://arxiv.org/abs/2306.11908)

Breaking the Speed Limit of Causal Inference: Introducing Fixed-Point Trees

Introduction In the world of modern machine learning, we have moved past the era of simply predicting an “average.” Whether in personalized medicine, computational advertising, or public policy, the most valuable insights often lie in heterogeneity—understanding how an effect varies across different subgroups of a population. For example, a new drug might have a modest average effect on a population but could be life-saving for one genetic subgroup and harmful to another. A straightforward regression model typically misses these nuances. ...

[Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting 🔗](https://arxiv.org/abs/2502.02797)

Go With the FLOW: How Prioritizing Easy Samples Fixes Catastrophic Forgetting

In the current landscape of Artificial Intelligence, we rarely train models from scratch. Instead, we stand on the shoulders of giants: we take a massive, pre-trained model (like Llama, Gemma, or ResNet) and “fine-tune” it on a specific dataset to perform a specific task, such as medical diagnosis or mathematical reasoning. However, there is a ghost in the machine. As models learn new tasks, they have a nasty habit of erasing what they already know. This phenomenon is called Catastrophic Forgetting. You teach a model to solve math problems, and suddenly it forgets how to construct a grammatically correct sentence or loses its general reasoning abilities. ...

[SAFE: Finding Sparse and Flat Minima to Improve Pruning 🔗](https://arxiv.org/abs/2506.06866)

Can We Have It All? Making Neural Networks Sparse and Smart with SAFE

Introduction In the modern era of Deep Learning, we are witnessing a “battle of the scales.” On one side, models are becoming exponentially larger—Large Language Models (LLMs) with billions of parameters are now standard. On the other side, the resources required to deploy these models are limited. We want to run these intelligent systems on phones, laptops, and edge devices, which simply cannot handle the memory and computational load of massive dense networks. ...

[Emergence and Effectiveness of Task Vectors in In-Context Learning: An Encoder Decoder Perspective 🔗](https://arxiv.org/abs/2412.12276)

Decoding the Magic: How LLMs Build 'Task Vectors' for In-Context Learning

Introduction One of the most fascinating capabilities of Large Language Models (LLMs) is In-Context Learning (ICL). You give a model a few examples—like “Apple -> Red, Banana -> Yellow”—and suddenly, without any weight updates or retraining, it understands the pattern and predicts “Lime -> Green.” To us, this feels intuitive. To a machine learning researcher, it is mathematically perplexing. How does a static set of weights adapt to a new task on the fly? ...

[Taming Knowledge Conflicts in Language Models 🔗](https://arxiv.org/abs/2503.10996)

JUICE: Resolving the Tug-of-War Between Memory and Context in LLMs

Introduction Imagine you are using a Large Language Model (LLM) for a Retrieval-Augmented Generation (RAG) task. You provide the model with a specific document stating, “The capital of France is Beijing,” perhaps as part of a hypothetical scenario or a fictional story. You ask the model: “What is the capital of France?” The model now faces a crisis. Its internal, pre-trained memory (parametric knowledge) screams “Paris!” But the context window (contextual knowledge) you provided clearly says “Beijing.” This phenomenon is known as a knowledge conflict. ...