How do we know if a Large Language Model (LLM) is doing a good job? This seemingly simple question is one of the hardest in modern AI. While humans can grade LLM responses, that process is slow, expensive, and impossible to scale. A promising alternative is to let another LLM serve as a judge, evaluating its peers. This “LLM-as-a-Judge” approach is quickly becoming essential for training, aligning, and evaluating language models.

But there’s a serious flaw in this system. Most LLM judges rely only on their own internal, text-based reasoning. They can write eloquent critiques and sound convincing — yet still fail when a task requires verifiable facts. They can’t count words reliably, perform precise calculations, or check complex code constraints. In other words, they can be misled by responses that look superficially correct but are factually wrong.

Imagine asking an LLM to write a poem with at least 350 words. A text-only judge might look at the poem, think “this seems long enough,” and approve it. But it might be entirely wrong. The example below illustrates this challenge vividly.

An LLM judge using a code executor to verify a word-count constraint, correctly identifying that a response is too short, while a text-based judge incorrectly approves it.

Figure 1 | A tool-augmented LLM judge uses a simple Python script to get the exact word count (321), correctly identifying that the response fails the 350-word constraint. A text-only judge may miscount and approve the response.

This is where the new research paper “Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning” comes in. The authors propose a framework called TIR-Judge, which gives LLM judges a new superpower: the ability to use a code interpreter. By learning to generate and execute code snippets, these judges can verify facts, perform calculations, and make judgments grounded in evidence rather than guesswork. The magic lies in an end-to-end reinforcement learning (RL) process that teaches the judge how and when to use the tool effectively.


The Problem with “Just Trust Me” AI Judges

Reward models and LLM judges are often trained to output a score or preference via textual reasoning. More sophisticated variants generate a “chain-of-thought” or critique before deciding, which helps them reason more coherently — but they still operate inside a text-only box. Without access to real-world tools, they cannot verify what they claim.

Early attempts to add tools revealed severe limitations:

  1. Inference-time only: The tool is added only when making predictions, not during training, preventing full integration between reasoning and execution.
  2. Narrow focus: Many methods target specific domains, such as code evaluation, and fail to generalize across tasks like dialogue or reasoning.

The TIR-Judge framework aims to fix these shortcomings by training judges to interleave reasoning and tool use throughout the learning process, improving both reliability and versatility.


Introducing TIR-Judge: A Judge That Shows Its Work

TIR-Judge rests on three foundational principles:

  1. Integrating tool use with reasoning
  2. Supporting multiple evaluation formats — pointwise, pairwise, and listwise
  3. Training via reinforcement learning to refine tool usage

The overall framework of TIR-Judge, showing its three main components: Tool-Integrated Reasoning, Judge Evaluation formats (pointwise, pairwise, listwise), and Reinforcement Learning training strategies.

Figure 2 | TIR-Judge combines tool-integrated reasoning, flexible judging formats, and robust RL-based training strategies.


1. Thinking with a Code Interpreter

At its core, TIR-Judge works iteratively. When given a prompt and responses to evaluate, the judge doesn’t just decide immediately — it reason-thinks-executes in multiple rounds:

  1. Reason: It determines what needs checking (e.g., word count, correctness).
  2. Code: If the task involves verifiable constraints, it writes Python code to test them.
  3. Execute: It runs the code in a controlled sandbox environment.
  4. Observe: The code output (number, boolean, or error) is fed back into its context.
  5. Repeat: Using this evidence, it continues reasoning until reaching a confident verdict.

Equation showing the iterative process of reasoning, code generation, and observation.

Figure 3 | The iterative process of reasoning, coding, execution, and observation enables the judge to ground decisions in verifiable evidence.

This cyclical interaction between reasoning and tool-use creates judges that can prove their evaluations rather than merely assert them.


2. Flexible Judging: Pointwise, Pairwise, and Listwise

Real-world evaluation takes many forms, and TIR-Judge accommodates all:

  • Pointwise: Give a numeric score to one response (e.g., “score = 7/10”).
  • Pairwise: Choose between two candidate responses (“Response A is better than Response B”).
  • Listwise: Select the best response from a list of candidates.

Training across these formats ensures TIR-Judge’s adaptability to diverse alignment and evaluation tasks.


3. Training through Reinforcement Learning

A model won’t naturally know how or when to use code. Reinforcement learning provides a structured path to teach these skills. The model generates evaluation trajectories — combining reasoning and tool-use — and receives rewards based on how accurate and well-formatted its judgments are. Over time, it learns to maximize these rewards.

The reward has three interconnected parts:

  1. Correctness Reward (\(R_c\)): The judge receives a positive reward when its final decision matches the ground truth.

Equation for the correctness reward, which is 1 if the judge’s prediction matches ground truth and 0 otherwise.

Figure 4 | The correctness reward assigns a score of 1 when the judge’s output aligns with the correct preference.

  1. Format Reward (\(R_f\)): Enforces structured output. For example, scores must appear within <score> tags, and code must reside inside python blocks. For non-verifiable tasks such as helpfulness or safety assessments, the judge earns full reward only if it avoids unnecessary tool calls.

  2. Tool-Specific Reward (\(R_t\)): Encourages sound coding practices by penalizing execution errors or excessive tool invocations. A trajectory earns full credit only if all code runs successfully, within a limit of three tool calls.

Together, these components form the composite reward:

Equation for the final combined reward, which multiplies correctness by a factor rewarding proper format and tool use.

Figure 5 | The total reward combines correctness, format, and tool-efficiency — rewarding precise, well-structured reasoning.

The model learns that accuracy alone isn’t enough; correctness, formatting discipline, and efficient tool use are all required for success.


Overcoming the Cold-Start Problem: Two Training Paths

A fresh LLM lacks structured reasoning and tool-use skills. The authors propose two complementary initialization schemes:

1. TIR-Judge-Distill: This variant uses a strong teacher model (Gemini 2.5 Flash with code execution) to generate high-quality judgment trajectories. Only the correct, well-formatted examples are kept. The smaller student model is then fine-tuned (SFT) on these examples before undergoing RL, ensuring a smooth ramp-up.

2. TIR-Judge-Zero: Can a model learn without supervision? Surprisingly, yes. TIR-Judge-Zero bootstraps itself through cycles of RL → Rejection Sampling → SFT:

Diagram showing the iterative cycle combining Reinforcement Learning (RL), Rejection Sampling (RS), and Supervised Fine-Tuning (SFT) for TIR-Judge-Zero.

Figure 6 | TIR-Judge-Zero self-improves through iterative cycles of reinforcement learning, rejection sampling, and supervised fine-tuning.

Each cycle refines the model’s reasoning and coding ability using its own best trajectories, steadily improving without external teacher data.


Putting TIR-Judge to the Test

The team evaluated TIR-Judge on seven public benchmarks covering reasoning, instruction-following, and code evaluation. Comparisons included leading proprietary judges (GPT‑4o, Claude 3.5) and open-source reward models.

Table comparing TIR-Judge performance across six benchmarks under pointwise and pairwise settings.

Table 1 | TIR-Judge consistently outperforms similarly sized reasoning judges, with strong results even in non-verifiable domains.

Key findings:

  • Superior Accuracy: Gains up to 6.4% (pointwise) and 7.7% (pairwise) over strong reasoning baselines.
  • RL Matters: Simply adding a code interpreter without RL yields negligible improvement, proving that learning how to use the tool is vital.
  • Self-Improvement Works: TIR-Judge-Zero, trained with no teacher supervision, often surpasses its distilled counterpart — achieving autonomous self-bootstrapping.

In the more demanding listwise setting, the 8B TIR-Judge reached 96% of Claude-Opus‑4’s performance, despite being much smaller.

Comparison of listwise evaluation results on RewardBench2, highlighting TIR-Judge-Zero’s strong performance.

Table 2 | TIR-Judge-Zero 8B nearly matches top proprietary judges such as Claude-Opus‑4 and Gemini‑2.5‑Flash, demonstrating exceptional efficiency.


What Drives TIR-Judge’s Success

Diverse Data Mixtures

Training with both verifiable (math, code) and non-verifiable (chat, safety) tasks yields robust generalization. Single-domain training leads to weak cross-domain performance.

Bar chart illustrating that combining instruction following, coding, reasoning, and helpfulness/safety tasks achieves best results.

Figure 3 | Diverse training data covering tool-use and text-only tasks produces well-rounded judges.

Iterative Reinforcement Learning Improves Performance

Each RL iteration visibly enhances accuracy and reasoning efficiency.

Bar charts showing steady improvement over multiple RL rounds.

Figure 4 | TIR-Judge-Zero accuracy steadily climbs through iterative RL cycles, confirming the power of self-improvement.

Efficient and Fast

Despite using external code execution, TIR-Judge runs faster than many text-only judges. Reinforcement learning rewards shorter, cleaner reasoning trajectories and fewer tool calls, reducing inference overhead.

Scatter plot comparing models by accuracy and inference speed.

Figure 5 | TIR-Judge variants (orange X’s) occupy the top-right—higher accuracy and faster responses—compared to traditional reasoning judges (green circles).


Case Study: When “Showing the Work” Matters

A compelling example from the IFEval benchmark demonstrates TIR-Judge’s advantage. The prompt asks for a letter written entirely in uppercase, where the letter “O” appears at least 40 times.

Case study comparing TIR-Judge and a text-only judge. TIR-Judge writes Python code to count occurrences of ‘O’ precisely.

Table 3 | TIR-Judge executes Python to count letters correctly, while the text-only judge hallucinates counts and makes the wrong decision.

The text-only judge attempts manual counting and guesses incorrectly. TIR-Judge-Zero, meanwhile, writes a simple script to count “O”s and verify capitalization. The execution reveals exact counts: Response A has 58 “O”s and is fully capitalized, Response B fails capitalization requirements. With factual proof in hand, TIR-Judge chooses correctly.

This experiment encapsulates the framework’s essence: replace intuition with verification.


Conclusion: A Step Toward More Trustworthy AI

TIR-Judge redefines how large language models evaluate others — and themselves. By blending structured reasoning with real tool-use and optimizing through reinforcement learning, it delivers more precise, transparent, and efficient evaluations.

The standout contribution, TIR-Judge-Zero, demonstrates that models can bootstrap complex reasoning and tool-use capabilities autonomously, without relying on stronger proprietary teachers. This ushers in a scalable future where AI systems refine themselves through continual, verifiable learning.

As LLMs become ubiquitous, trust and factuality are paramount. Tool-integrated judges like TIR-Judge pave the way toward a future where we don’t just take the AI’s word — we can ask it to show the proof.