How do we know if a Large Language Model (LLM) is doing a good job? This seemingly simple question is one of the hardest in modern AI. While humans can grade LLM responses, that process is slow, expensive, and impossible to scale. A promising alternative is to let another LLM serve as a judge, evaluating its peers. This “LLM-as-a-Judge” approach is quickly becoming essential for training, aligning, and evaluating language models.
But there’s a serious flaw in this system. Most LLM judges rely only on their own internal, text-based reasoning. They can write eloquent critiques and sound convincing — yet still fail when a task requires verifiable facts. They can’t count words reliably, perform precise calculations, or check complex code constraints. In other words, they can be misled by responses that look superficially correct but are factually wrong.
Imagine asking an LLM to write a poem with at least 350 words. A text-only judge might look at the poem, think “this seems long enough,” and approve it. But it might be entirely wrong. The example below illustrates this challenge vividly.

Figure 1 | A tool-augmented LLM judge uses a simple Python script to get the exact word count (321), correctly identifying that the response fails the 350-word constraint. A text-only judge may miscount and approve the response.
This is where the new research paper “Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning” comes in. The authors propose a framework called TIR-Judge, which gives LLM judges a new superpower: the ability to use a code interpreter. By learning to generate and execute code snippets, these judges can verify facts, perform calculations, and make judgments grounded in evidence rather than guesswork. The magic lies in an end-to-end reinforcement learning (RL) process that teaches the judge how and when to use the tool effectively.
The Problem with “Just Trust Me” AI Judges
Reward models and LLM judges are often trained to output a score or preference via textual reasoning. More sophisticated variants generate a “chain-of-thought” or critique before deciding, which helps them reason more coherently — but they still operate inside a text-only box. Without access to real-world tools, they cannot verify what they claim.
Early attempts to add tools revealed severe limitations:
- Inference-time only: The tool is added only when making predictions, not during training, preventing full integration between reasoning and execution.
- Narrow focus: Many methods target specific domains, such as code evaluation, and fail to generalize across tasks like dialogue or reasoning.
The TIR-Judge framework aims to fix these shortcomings by training judges to interleave reasoning and tool use throughout the learning process, improving both reliability and versatility.
Introducing TIR-Judge: A Judge That Shows Its Work
TIR-Judge rests on three foundational principles:
- Integrating tool use with reasoning
- Supporting multiple evaluation formats — pointwise, pairwise, and listwise
- Training via reinforcement learning to refine tool usage

Figure 2 | TIR-Judge combines tool-integrated reasoning, flexible judging formats, and robust RL-based training strategies.
1. Thinking with a Code Interpreter
At its core, TIR-Judge works iteratively. When given a prompt and responses to evaluate, the judge doesn’t just decide immediately — it reason-thinks-executes in multiple rounds:
- Reason: It determines what needs checking (e.g., word count, correctness).
- Code: If the task involves verifiable constraints, it writes Python code to test them.
- Execute: It runs the code in a controlled sandbox environment.
- Observe: The code output (number, boolean, or error) is fed back into its context.
- Repeat: Using this evidence, it continues reasoning until reaching a confident verdict.

Figure 3 | The iterative process of reasoning, coding, execution, and observation enables the judge to ground decisions in verifiable evidence.
This cyclical interaction between reasoning and tool-use creates judges that can prove their evaluations rather than merely assert them.
2. Flexible Judging: Pointwise, Pairwise, and Listwise
Real-world evaluation takes many forms, and TIR-Judge accommodates all:
- Pointwise: Give a numeric score to one response (e.g., “score = 7/10”).
- Pairwise: Choose between two candidate responses (“Response A is better than Response B”).
- Listwise: Select the best response from a list of candidates.
Training across these formats ensures TIR-Judge’s adaptability to diverse alignment and evaluation tasks.
3. Training through Reinforcement Learning
A model won’t naturally know how or when to use code. Reinforcement learning provides a structured path to teach these skills. The model generates evaluation trajectories — combining reasoning and tool-use — and receives rewards based on how accurate and well-formatted its judgments are. Over time, it learns to maximize these rewards.
The reward has three interconnected parts:
- Correctness Reward (\(R_c\)): The judge receives a positive reward when its final decision matches the ground truth.

Figure 4 | The correctness reward assigns a score of 1 when the judge’s output aligns with the correct preference.
Format Reward (\(R_f\)): Enforces structured output. For example, scores must appear within
<score>tags, and code must reside insidepythonblocks. For non-verifiable tasks such as helpfulness or safety assessments, the judge earns full reward only if it avoids unnecessary tool calls.Tool-Specific Reward (\(R_t\)): Encourages sound coding practices by penalizing execution errors or excessive tool invocations. A trajectory earns full credit only if all code runs successfully, within a limit of three tool calls.
Together, these components form the composite reward:

Figure 5 | The total reward combines correctness, format, and tool-efficiency — rewarding precise, well-structured reasoning.
The model learns that accuracy alone isn’t enough; correctness, formatting discipline, and efficient tool use are all required for success.
Overcoming the Cold-Start Problem: Two Training Paths
A fresh LLM lacks structured reasoning and tool-use skills. The authors propose two complementary initialization schemes:
1. TIR-Judge-Distill: This variant uses a strong teacher model (Gemini 2.5 Flash with code execution) to generate high-quality judgment trajectories. Only the correct, well-formatted examples are kept. The smaller student model is then fine-tuned (SFT) on these examples before undergoing RL, ensuring a smooth ramp-up.
2. TIR-Judge-Zero: Can a model learn without supervision? Surprisingly, yes. TIR-Judge-Zero bootstraps itself through cycles of RL → Rejection Sampling → SFT:

Figure 6 | TIR-Judge-Zero self-improves through iterative cycles of reinforcement learning, rejection sampling, and supervised fine-tuning.
Each cycle refines the model’s reasoning and coding ability using its own best trajectories, steadily improving without external teacher data.
Putting TIR-Judge to the Test
The team evaluated TIR-Judge on seven public benchmarks covering reasoning, instruction-following, and code evaluation. Comparisons included leading proprietary judges (GPT‑4o, Claude 3.5) and open-source reward models.

Table 1 | TIR-Judge consistently outperforms similarly sized reasoning judges, with strong results even in non-verifiable domains.
Key findings:
- Superior Accuracy: Gains up to 6.4% (pointwise) and 7.7% (pairwise) over strong reasoning baselines.
- RL Matters: Simply adding a code interpreter without RL yields negligible improvement, proving that learning how to use the tool is vital.
- Self-Improvement Works: TIR-Judge-Zero, trained with no teacher supervision, often surpasses its distilled counterpart — achieving autonomous self-bootstrapping.
In the more demanding listwise setting, the 8B TIR-Judge reached 96% of Claude-Opus‑4’s performance, despite being much smaller.

Table 2 | TIR-Judge-Zero 8B nearly matches top proprietary judges such as Claude-Opus‑4 and Gemini‑2.5‑Flash, demonstrating exceptional efficiency.
What Drives TIR-Judge’s Success
Diverse Data Mixtures
Training with both verifiable (math, code) and non-verifiable (chat, safety) tasks yields robust generalization. Single-domain training leads to weak cross-domain performance.

Figure 3 | Diverse training data covering tool-use and text-only tasks produces well-rounded judges.
Iterative Reinforcement Learning Improves Performance
Each RL iteration visibly enhances accuracy and reasoning efficiency.

Figure 4 | TIR-Judge-Zero accuracy steadily climbs through iterative RL cycles, confirming the power of self-improvement.
Efficient and Fast
Despite using external code execution, TIR-Judge runs faster than many text-only judges. Reinforcement learning rewards shorter, cleaner reasoning trajectories and fewer tool calls, reducing inference overhead.

Figure 5 | TIR-Judge variants (orange X’s) occupy the top-right—higher accuracy and faster responses—compared to traditional reasoning judges (green circles).
Case Study: When “Showing the Work” Matters
A compelling example from the IFEval benchmark demonstrates TIR-Judge’s advantage. The prompt asks for a letter written entirely in uppercase, where the letter “O” appears at least 40 times.

Table 3 | TIR-Judge executes Python to count letters correctly, while the text-only judge hallucinates counts and makes the wrong decision.
The text-only judge attempts manual counting and guesses incorrectly. TIR-Judge-Zero, meanwhile, writes a simple script to count “O”s and verify capitalization. The execution reveals exact counts: Response A has 58 “O”s and is fully capitalized, Response B fails capitalization requirements. With factual proof in hand, TIR-Judge chooses correctly.
This experiment encapsulates the framework’s essence: replace intuition with verification.
Conclusion: A Step Toward More Trustworthy AI
TIR-Judge redefines how large language models evaluate others — and themselves. By blending structured reasoning with real tool-use and optimizing through reinforcement learning, it delivers more precise, transparent, and efficient evaluations.
The standout contribution, TIR-Judge-Zero, demonstrates that models can bootstrap complex reasoning and tool-use capabilities autonomously, without relying on stronger proprietary teachers. This ushers in a scalable future where AI systems refine themselves through continual, verifiable learning.
As LLMs become ubiquitous, trust and factuality are paramount. Tool-integrated judges like TIR-Judge pave the way toward a future where we don’t just take the AI’s word — we can ask it to show the proof.
](https://deep-paper.org/en/paper/2510.23038/images/cover.png)