If you’ve ever used an AI coding assistant like GitHub Copilot, you’ve probably engaged in what researchers now call “vibe coding.” You don’t just ask for code once—you have a conversation. You might start with a basic request, then refine it:
“Okay, that works, but can you rewrite it using a for loop instead of recursion?”
or
“Add some comments, and make sure all lines are under 80 characters.”

You keep tweaking until the code not only functions but also feels right. It passes your personal “vibe check.” That feeling of “rightness” goes beyond logic—it includes readability, consistency with project style, minimal edits, and following nuanced, non-functional requests.

Here’s the core issue: while coding workflows have evolved into this interactive, vibe-centered practice, our evaluation methods haven’t kept up. Standard benchmarks still rely on metrics like pass@k, which only test whether the code passes unit tests. These metrics measure functionality alone—they don’t tell us whether the AI respected your stylistic or project-specific constraints. As a result, AI coding models often score highly in academic tests yet fail the vibe check in day-to-day use.

A new research paper from DeepMind and UIUC, “Vibe Checker: Aligning Code Evaluation with Human Preference,” argues that what’s missing in our evaluation toolkit is instruction following—the ability to honor non-functional requests users often make. The paper introduces a framework to measure this ability directly, offering the first concrete path toward aligning code generation models with human preferences.

A cartoon illustrating the concept of a “vibe check” in coding. A user asks for a Fibonacci function with a for loop. The first attempt uses recursion and fails the vibe check, even though it passes functional tests. The second attempt uses a for loop, passes both tests and the vibe check.

Figure 1 | Vibe check goes beyond functionality: users judge whether the code feels right, not just whether it runs correctly.


The Limits of “Correctness”

For years, the dominant metric for evaluating code generation was pass@k. It’s simple: have a model generate k candidate solutions to a programming problem and score how many pass a predefined set of unit tests. If the code runs correctly, it wins a point.

This metric is useful for measuring functional competence, but it falls short in real coding interactions. When a developer says,
“Write a function to parse this log file, and please use the pathlib library,”
pass@k only checks whether the parsing works. If the model uses outdated os.path calls but gets the logic correct, pass@k gives full credit—even though the developer explicitly asked for pathlib.

The result? Benchmarks often fail to reflect human judgment. Rankings based on functional metrics don’t correlate well with preference-based leaderboards like Copilot Arena, where developers pick the snippet they actually like. A model that’s great at brute-force code solving might be a terrible collaborator, repeatedly ignoring stylistic details or project conventions.

To create genuinely useful coding assistants, evaluation must evolve beyond the binary notion of correctness.


VERICODE: A Dictionary of Verifiable Instructions

The researchers tackled this challenge by introducing VERICODE, a taxonomy of 30 verifiable code instructions. Each instruction captures a common non-functional requirement—rules that users often demand but that traditional benchmarks ignore.

VERICODE is grounded in four design principles:

  1. Verifiability: Every instruction is paired with an automated, deterministic verifier that returns a binary pass/fail. No subjective human judgment or unreliable LLM-as-a-judge required.
  2. Practice Grounding: The instructions reflect real-world conventions, sourced from over 800 industrial linter rules (not arbitrary language tricks).
  3. Comprehensive Coverage: The taxonomy spans five categories—Coding Style, Logic Patterns, Documentation, Error Handling, and Library/API Constraints.
  4. Meaningful Difficulty: Each instruction is curated to challenge advanced LLMs while remaining realistic for actual programming tasks.

A table showing examples of instructions from the VERICODE taxonomy across different categories like Coding Style, Logic Patterns, and Documentation. Each instruction has a prompt, a corresponding verifier (linter rule), and tunable parameters.

Figure 2 | Examples from VERICODE taxonomy. Each instruction is tied to a deterministic verifier and parameterized for scalable testing.

For example, one Coding Style instruction reads: “Write code ensuring all lines are no longer than {line_length} characters.” Its verifier checks compliance using standard linter rule E501, with a line_length parameter often set to 79 or 88. This flexible design means the 30 core instructions can be automatically expanded into hundreds of distinct, checkable constraints—creating a robust foundation for testing instruction-following capability.


VIBE CHECKER: The New Gauntlet for Code LLMs

Built on VERICODE, VIBE CHECKER is a new testbed for evaluating models on both functional correctness and instruction following (IF). The researchers augmented two major benchmarks to incorporate verifiable constraints:

  • BigVibeBench: derived from BigCodeBench, focusing on real-world programming tasks.
  • LiveVibeBench: derived from LiveCodeBench, focusing on algorithmic and contest-style problems.

An LLM-based selector chooses relevant, non-conflicting instructions from VERICODE for each problem. The augmented frameworks simulate two realistic interaction patterns, shown below.

An infographic comparing Single-Turn Generation and Multi-Turn Editing. In single-turn, all instructions are given at once. In multi-turn, instructions are provided sequentially to refine the code. Both are evaluated for Functionality and Instruction Following.

Figure 3 | The evaluation protocol models two settings: single-turn generation (all constraints at once) and multi-turn editing (constraints added over time).

There are two key modes:

  1. Single-Turn Generation: The model gets all instructions in one prompt and has a single chance to produce a compliant solution.
  2. Multi-Turn Editing: The model first writes a base implementation. Then, instructions are provided one at a time in subsequent turns, forcing iterative refinement while preserving previous intent.

Each final answer is graded along two axes:

  • Functionality: Whether the code still passes unit tests. The authors define Functional Regression:

    \[ \mathrm{FR}_k = \frac{S_0 - S_k}{S_0} \]

    where \( S_0 \) is the original score and \( S_k \) is the score after adding k instructions.

  • Instruction Following (IF): Whether the model adheres to all given constraints, measured both per-instruction and per-task:

    \[ \mathrm{IF}_{\mathrm{instruction}} = \frac{1}{k} \sum_{j=1}^{k} I_j, \quad \mathrm{IF}_{\mathrm{task}} = \mathbb{1}\left[ \sum_{j=1}^{k} I_j = k \right] \]

The Results Are In: How Do Top LLMs Fare on the Vibe Check?

The researchers evaluated 31 leading LLMs across ten model families—including Gemini, Claude, GPT, DeepSeek, Qwen, and others. The findings illuminate deep gaps between what models currently optimize for and what developers actually care about.

1. Instruction Following Often Breaks the Code

You might expect that stylistic tweaks—like enforcing shorter lines or adding docstrings—wouldn’t impact functionality. But the data tell a different story: non-functional instructions consistently cause functional regression.

A table showing the functional regression rates for top-performing LLMs on BigVibeBench and LiveVibeBench. Many cells are highlighted in red, indicating a regression of over 5% or 10%.

Figure 4 | Adding non-functional constraints reduces functional correctness across all models. Deep red indicates >10% regression.

On BigVibeBench, nearly every model lost over 5% accuracy when five instructions were added in multi-turn mode. On LiveVibeBench, regressions exceeding 10% were common.
Interestingly, single-turn generation preserves functionality better: seeing all constraints upfront allows the model to balance edits holistically, while multi-turn editing introduces compounding side effects and breaks.

Line graphs showing that as the number of instructions increases, functional regression worsens (a), while the task-level instruction following score plummets (b).

Figure 5 | As constraints increase, functionality declines while task-level instruction adherence drops sharply.


2. Models Struggle to Follow Multiple Instructions

LLMs are reasonably good at obeying one rule—but when asked to follow several simultaneously, their performance plummets.

A table showing the task-level instruction following scores for top models. Scores drop significantly as the number of instructions increases from 1 to 5, with many falling below 50% and even 30%.

Figure 6 | Task-level IF scores drop exponentially with more instructions. Even top models fail most multi-constraint tasks.

With just three instructions, most LLMs fail more than half the time. At five instructions, even Claude 4 Opus—among the strongest performers—succeeds only 46.75% of the time on BigVibeBench. This decay follows basic probability: if a model has 90% success per instruction, five independent constraints yield only \(0.9^5 = 59\%\) overall success.

Yet, a silver lining: multi-turn editing improves instruction adherence, since sequential presentation helps models focus on one change at a time. The trade-off is clear—multi-turn boosts stylistic accuracy but costs correctness.


3. Models Suffer from “Lost in the Middle” Bias

Just as humans skim long documents, LLMs pay uneven attention across prompts. The analysis uncovered a classic U-shaped pattern, where models most reliably follow the first and last instructions, neglecting those in between.

Two line charts showing that instruction-level success is highest for instructions at the beginning and end of a prompt, and dips in the middle.

Figure 7 | Instruction-level adherence by position. Single-turn favors early constraints (primacy bias); multi-turn favors the latest (recency bias).

Single-turn generation favors the initial constraint—a primacy bias—while multi-turn editing peaks on the last instruction added—a recency bias. Practically speaking, this means prompt order matters: if an instruction is critical, place it first or last for best results.


4. The Punchline: A “Vibe Score” Predicts Human Preference

Performance metrics are valuable only if they mirror human judgment. To test that, the authors compared their results with LMArena, a massive dataset of over 800,000 human-coded votes ranking model outputs.

They computed correlations between human preference and a composite metric combining functionality and instruction following:

\[ \text{Composite Score} = \alpha \ \mathrm{IF} + (1 - \alpha) \ \mathrm{Func} \]

Four plots showing the correlation between a composite score and human preference. In all cases, the peak correlation (marked with a star) occurs when the score is a mixture of both Instruction Following (IF) and Functionality (Func).

Figure 8 | Human preferences align best with a mix of instruction following and functional correctness. Stars mark optimal weightings.

Across both Pearson and Spearman correlations, the best agreement was always achieved with a mixture of the two signals—not pure functionality or pure instruction following alone. For real-world programming (BigVibeBench), human ratings weighted instruction following more heavily (α ≈ 0.7). For algorithmic tasks (LiveVibeBench), functionality dominated (α ≈ 0.4).

This pattern mirrors real usage: in day-to-day development, users value models that respect coding conventions and maintain clean style; in competitive scenarios, correctness outranks style.


Conclusion: Moving Beyond pass@k

The Vibe Checker study makes a compelling case for rethinking how we measure success in AI-assisted coding. Current benchmarks celebrate technical validity—whether a snippet works. But in real workflows, utility depends equally on clarity, adaptability, and compliance with user intent.

By introducing VERICODE, a deterministic taxonomy of human-standard instructions, and VIBE CHECKER, a unified testbed that combines functionality with instruction following, the researchers provide a path toward evaluations that capture what coders actually value.

The takeaway is simple but powerful:
LLMs should be judged not only on whether their code runs, but whether it feels right—whether it passes the vibe check.