If you’ve ever used an AI coding assistant like GitHub Copilot, you’ve probably engaged in what researchers now call “vibe coding.” You don’t just ask for code once—you have a conversation. You might start with a basic request, then refine it:
“Okay, that works, but can you rewrite it using a for
loop instead of recursion?”
or
“Add some comments, and make sure all lines are under 80 characters.”
You keep tweaking until the code not only functions but also feels right. It passes your personal “vibe check.” That feeling of “rightness” goes beyond logic—it includes readability, consistency with project style, minimal edits, and following nuanced, non-functional requests.
Here’s the core issue: while coding workflows have evolved into this interactive, vibe-centered practice, our evaluation methods haven’t kept up. Standard benchmarks still rely on metrics like pass@k
, which only test whether the code passes unit tests. These metrics measure functionality alone—they don’t tell us whether the AI respected your stylistic or project-specific constraints. As a result, AI coding models often score highly in academic tests yet fail the vibe check in day-to-day use.
A new research paper from DeepMind and UIUC, “Vibe Checker: Aligning Code Evaluation with Human Preference,” argues that what’s missing in our evaluation toolkit is instruction following—the ability to honor non-functional requests users often make. The paper introduces a framework to measure this ability directly, offering the first concrete path toward aligning code generation models with human preferences.
Figure 1 | Vibe check goes beyond functionality: users judge whether the code feels right, not just whether it runs correctly.
The Limits of “Correctness”
For years, the dominant metric for evaluating code generation was pass@k
. It’s simple: have a model generate k candidate solutions to a programming problem and score how many pass a predefined set of unit tests. If the code runs correctly, it wins a point.
This metric is useful for measuring functional competence, but it falls short in real coding interactions. When a developer says,
“Write a function to parse this log file, and please use the pathlib
library,”pass@k
only checks whether the parsing works. If the model uses outdated os.path
calls but gets the logic correct, pass@k
gives full credit—even though the developer explicitly asked for pathlib
.
The result? Benchmarks often fail to reflect human judgment. Rankings based on functional metrics don’t correlate well with preference-based leaderboards like Copilot Arena, where developers pick the snippet they actually like. A model that’s great at brute-force code solving might be a terrible collaborator, repeatedly ignoring stylistic details or project conventions.
To create genuinely useful coding assistants, evaluation must evolve beyond the binary notion of correctness.
VERICODE: A Dictionary of Verifiable Instructions
The researchers tackled this challenge by introducing VERICODE, a taxonomy of 30 verifiable code instructions. Each instruction captures a common non-functional requirement—rules that users often demand but that traditional benchmarks ignore.
VERICODE is grounded in four design principles:
- Verifiability: Every instruction is paired with an automated, deterministic verifier that returns a binary pass/fail. No subjective human judgment or unreliable LLM-as-a-judge required.
- Practice Grounding: The instructions reflect real-world conventions, sourced from over 800 industrial linter rules (not arbitrary language tricks).
- Comprehensive Coverage: The taxonomy spans five categories—Coding Style, Logic Patterns, Documentation, Error Handling, and Library/API Constraints.
- Meaningful Difficulty: Each instruction is curated to challenge advanced LLMs while remaining realistic for actual programming tasks.
Figure 2 | Examples from VERICODE taxonomy. Each instruction is tied to a deterministic verifier and parameterized for scalable testing.
For example, one Coding Style instruction reads: “Write code ensuring all lines are no longer than {line_length} characters.” Its verifier checks compliance using standard linter rule E501, with a line_length
parameter often set to 79 or 88. This flexible design means the 30 core instructions can be automatically expanded into hundreds of distinct, checkable constraints—creating a robust foundation for testing instruction-following capability.
VIBE CHECKER: The New Gauntlet for Code LLMs
Built on VERICODE, VIBE CHECKER is a new testbed for evaluating models on both functional correctness and instruction following (IF). The researchers augmented two major benchmarks to incorporate verifiable constraints:
- BigVibeBench: derived from BigCodeBench, focusing on real-world programming tasks.
- LiveVibeBench: derived from LiveCodeBench, focusing on algorithmic and contest-style problems.
An LLM-based selector chooses relevant, non-conflicting instructions from VERICODE for each problem. The augmented frameworks simulate two realistic interaction patterns, shown below.
Figure 3 | The evaluation protocol models two settings: single-turn generation (all constraints at once) and multi-turn editing (constraints added over time).
There are two key modes:
- Single-Turn Generation: The model gets all instructions in one prompt and has a single chance to produce a compliant solution.
- Multi-Turn Editing: The model first writes a base implementation. Then, instructions are provided one at a time in subsequent turns, forcing iterative refinement while preserving previous intent.
Each final answer is graded along two axes:
Functionality: Whether the code still passes unit tests. The authors define Functional Regression:
\[ \mathrm{FR}_k = \frac{S_0 - S_k}{S_0} \]where \( S_0 \) is the original score and \( S_k \) is the score after adding k instructions.
Instruction Following (IF): Whether the model adheres to all given constraints, measured both per-instruction and per-task:
\[ \mathrm{IF}_{\mathrm{instruction}} = \frac{1}{k} \sum_{j=1}^{k} I_j, \quad \mathrm{IF}_{\mathrm{task}} = \mathbb{1}\left[ \sum_{j=1}^{k} I_j = k \right] \]
The Results Are In: How Do Top LLMs Fare on the Vibe Check?
The researchers evaluated 31 leading LLMs across ten model families—including Gemini, Claude, GPT, DeepSeek, Qwen, and others. The findings illuminate deep gaps between what models currently optimize for and what developers actually care about.
1. Instruction Following Often Breaks the Code
You might expect that stylistic tweaks—like enforcing shorter lines or adding docstrings—wouldn’t impact functionality. But the data tell a different story: non-functional instructions consistently cause functional regression.
Figure 4 | Adding non-functional constraints reduces functional correctness across all models. Deep red indicates >10% regression.
On BigVibeBench, nearly every model lost over 5% accuracy when five instructions were added in multi-turn mode. On LiveVibeBench, regressions exceeding 10% were common.
Interestingly, single-turn generation preserves functionality better: seeing all constraints upfront allows the model to balance edits holistically, while multi-turn editing introduces compounding side effects and breaks.
Figure 5 | As constraints increase, functionality declines while task-level instruction adherence drops sharply.
2. Models Struggle to Follow Multiple Instructions
LLMs are reasonably good at obeying one rule—but when asked to follow several simultaneously, their performance plummets.
Figure 6 | Task-level IF scores drop exponentially with more instructions. Even top models fail most multi-constraint tasks.
With just three instructions, most LLMs fail more than half the time. At five instructions, even Claude 4 Opus—among the strongest performers—succeeds only 46.75% of the time on BigVibeBench. This decay follows basic probability: if a model has 90% success per instruction, five independent constraints yield only \(0.9^5 = 59\%\) overall success.
Yet, a silver lining: multi-turn editing improves instruction adherence, since sequential presentation helps models focus on one change at a time. The trade-off is clear—multi-turn boosts stylistic accuracy but costs correctness.
3. Models Suffer from “Lost in the Middle” Bias
Just as humans skim long documents, LLMs pay uneven attention across prompts. The analysis uncovered a classic U-shaped pattern, where models most reliably follow the first and last instructions, neglecting those in between.
Figure 7 | Instruction-level adherence by position. Single-turn favors early constraints (primacy bias); multi-turn favors the latest (recency bias).
Single-turn generation favors the initial constraint—a primacy bias—while multi-turn editing peaks on the last instruction added—a recency bias. Practically speaking, this means prompt order matters: if an instruction is critical, place it first or last for best results.
4. The Punchline: A “Vibe Score” Predicts Human Preference
Performance metrics are valuable only if they mirror human judgment. To test that, the authors compared their results with LMArena, a massive dataset of over 800,000 human-coded votes ranking model outputs.
They computed correlations between human preference and a composite metric combining functionality and instruction following:
\[ \text{Composite Score} = \alpha \ \mathrm{IF} + (1 - \alpha) \ \mathrm{Func} \]Figure 8 | Human preferences align best with a mix of instruction following and functional correctness. Stars mark optimal weightings.
Across both Pearson and Spearman correlations, the best agreement was always achieved with a mixture of the two signals—not pure functionality or pure instruction following alone. For real-world programming (BigVibeBench), human ratings weighted instruction following more heavily (α ≈ 0.7). For algorithmic tasks (LiveVibeBench), functionality dominated (α ≈ 0.4).
This pattern mirrors real usage: in day-to-day development, users value models that respect coding conventions and maintain clean style; in competitive scenarios, correctness outranks style.
Conclusion: Moving Beyond pass@k
The Vibe Checker study makes a compelling case for rethinking how we measure success in AI-assisted coding. Current benchmarks celebrate technical validity—whether a snippet works. But in real workflows, utility depends equally on clarity, adaptability, and compliance with user intent.
By introducing VERICODE, a deterministic taxonomy of human-standard instructions, and VIBE CHECKER, a unified testbed that combines functionality with instruction following, the researchers provide a path toward evaluations that capture what coders actually value.
The takeaway is simple but powerful:
LLMs should be judged not only on whether their code runs, but whether it feels right—whether it passes the vibe check.