Introduction

In the rapidly evolving world of Artificial Intelligence, we have reached a fascinating inception point: we are now using AI to grade AI. As Large Language Models (LLMs) like GPT-4, Claude, and Llama become increasingly sophisticated, human evaluation has become a bottleneck. It is slow, expensive, and difficult to scale. Consequently, researchers have turned to “LLM-as-a-Judge”—using powerful models to evaluate the responses of other models.

However, just like human judges, AI judges have biases. One of the most pervasive and troublesome issues is position bias. When presented with two candidate answers (Answer A and Answer B), an LLM evaluator will often prefer the first one simply because it appeared first, or the second one because it was the last thing it read. This has nothing to do with the quality of the answer and everything to do with the model’s internal processing quirks.

If an evaluator prefers “Answer A” when it is presented first, but then switches its preference to “Answer B” just because you swapped their order, the evaluator is inconsistent. This inconsistency renders the evaluation unreliable.

In this deep dive, we will explore a research paper that proposes a novel solution to this problem: PORTIA. Developed by researchers from Hong Kong University of Science and Technology and other institutions, PORTIA (named after the clever judge in Shakespeare’s The Merchant of Venice) uses a “Split and Merge” strategy to mathematically and semantically align answers. By breaking answers down into segments and reassembling them, PORTIA forces the LLM to evaluate content fairly, drastically reducing bias and allowing cheaper, faster models to perform as well as state-of-the-art giants.

The Problem: Position Bias in Pairwise Comparison

To understand the solution, we must first understand the specific evaluation paradigm being used. The paper focuses on pairwise comparison, the most common method for ranking LLM outputs.

In a pairwise comparison, the evaluator is given a user question and two responses: Response A and Response B. The evaluator must decide which response is better, or if it is a tie.

Ideally, a judge should be robust to permutation. If Response A is better than Response B, the judge should say “A is better” regardless of whether A is shown first or second. Mathematically, if \(V\) is the verdict function:

\[LLM(\{q, r_1, r_2\}) = LLM(\{q, r_2, r_1\})\]

However, in reality, LLMs fail this test frequently.

Figure 1: A sample pairwise LLM-based evaluation improved by PORTIA. Left: The original evaluation exhibiting inconsistency. Right: Consistent evaluation after applying PORTIA.

As shown in Figure 1 (Left), a standard evaluation might declare the first assistant the winner. But if you flip the order, it might declare the new first assistant the winner. This is a “failure of consistency.” The image on the right shows the desired outcome: regardless of the input order, the verdict remains stable.

The researchers note that this bias isn’t just a quirk of smaller models; even GPT-4 exhibits it. The consequence is that developers are forced to use the most expensive models (like GPT-4) because they are less biased than others, making automated evaluation costly. If we could fix this bias, we could potentially use cheaper, faster models (like GPT-3.5 or Llama-2) to achieve GPT-4 level reliability.

The Inspiration: How Humans Read

Why does PORTIA work? The design is inspired by cognitive science regarding how humans process long texts. When a human evaluator is faced with two long, complex essays to compare, they rarely hold the entirety of both texts in their working memory at once. Instead, they use a decomposition strategy.

Humans naturally break information into smaller chunks. We might compare the introduction of Essay A with the introduction of Essay B. Then, we compare the methodology section of A with the methodology of B. By comparing aligned segments, we reduce cognitive load and make fairer assessments.

PORTIA attempts to mimic this human strategy programmatically. Instead of feeding the LLM two massive blocks of text, it:

Splits the answers into smaller segments.
Aligns those segments (so we are comparing apples to apples).
Merges them into a single, structured prompt for the LLM.

The Core Method: The PORTIA System

The PORTIA system is an alignment-based framework designed to calibrate position bias without needing to retrain or fine-tune the LLM evaluator itself. This makes it a “lightweight” solution that can be applied to any model, proprietary or open-source.

The Pipeline

The overall workflow of PORTIA is distinct from a standard evaluation pipeline. In a standard flow, you simply feed the question and two answers to the LLM. In PORTIA, the data goes through a preprocessing stage involving “Recognition,” “Semantic Alignment,” and “Length Alignment” before reaching the evaluator.

Figure 4: This is the overview of using PORTIA for LLM evaluation. “Reco” and “SFT” are short for “recognition” and “supervised fine-tuning”, respectively.

As illustrated in Figure 4, the “Flow after fixing” (green arrows) introduces these critical alignment steps. If the LLM output is still messy (e.g., it doesn’t give a clear score), PORTIA also includes a formatting extractor to ensure a valid verdict is recorded.

The Splitting Algorithm

The heart of the paper is the splitting algorithm. The goal is to divide two differing answers, \(r_1\) and \(r_2\), into \(k\) segments such that the segments are comparable.

There are two major constraints the authors had to respect:

Content Preservation: The split segments must contain all the information from the original answer. Nothing can be deleted or added.
Order Preservation: The flow of the argument must remain the same. You cannot scramble the sentences.

The algorithm operates in three phases, moving from simple to complex strategies.

Phase 1: Identifying Sentence Boundaries

You cannot simply chop a text after every 100 characters, or you might split a word in half. PORTIA first parses the text to find sentence boundaries (periods, question marks, etc.). For code blocks, it uses a parser (Tree-sitter) to ensure it doesn’t break a line of code that relies on syntax like indentation.

Phase 2: Length Alignment

The fastest way to split text is by length. PORTIA attempts to divide the answer into \(k\) segments of roughly equal character counts.

Figure 5: Schematic illustration of the proposed splitting algorithm, depicting its operation when configured with k = 2.

Figure 5 visualizes this process.

The system calculates where the split should be mathematically to get equal lengths (the red markers).
It looks for the nearest valid sentence boundary (the purple lines).
It selects the best candidate (the cyan markers).

Once the text is split, PORTIA constructs a prompt where the segments are interleaved or presented in a structured way, and asks the LLM for a verdict. It checks consistency by flipping the order. If the LLM gives the same winner both times, the job is done.

Phase 3: Semantic Alignment

What if the answers have similar lengths but totally different structures? For example, Answer A might have a long intro and a short conclusion, while Answer B has a short intro and a detailed conclusion. Splitting strictly by length might align Answer A’s intro with Answer B’s conclusion—a mismatch that confuses the evaluator.

If Length Alignment fails to produce a consistent verdict, PORTIA moves to Semantic Alignment.

This is an iterative search process. The algorithm tries to find split points in both answers that maximize the semantic similarity between the resulting segments. It uses metrics like token overlap or Sentence-BERT embeddings to calculate how similar Segment 1 of Answer A is to Segment 1 of Answer B.

Algorithm 1: Alignment-based Splitting

The pseudocode in Algorithm 1 details this logic.

Lines 1-2: Identify valid split positions (formatting).
Lines 3-7 (Length Alignment): Try splitting by equal length. If consistent (\(v == v_{flip}\)), return the result.
Lines 8-18 (Semantic Alignment): If inconsistent, enter a loop. It iterates through possible partitions, calculating a cumulative similarity score (\(s_{cum}\)) between the segments. It keeps the split configuration that maximizes this similarity (\(s_{max}\)).
Lines 19-21: Re-evaluate using the semantically aligned segments.

By aligning semantically, PORTIA ensures that the LLM is comparing the “Intro to the Intro” and the “Conclusion to the Conclusion,” mimicking human reading strategies.

Experimental Setup

To validate this system, the researchers conducted extensive experiments.

Dataset: MT-Bench, a high-quality set of open-ended questions across categories like Coding, Reasoning, and Roleplay.
Evaluators (The Judges): They tested six LLMs: GPT-4, GPT-3.5, Claude-2, Qwen, ChatGLM2, and Llama-2.
Comparison Forms: They tested three ways of asking the LLM to judge:

Relation-based: “Who is better? A, B, or Tie?”
Score-based: “Score A and B from 1-10.”
Likert-based: “Rate preference on a 1-5 scale.”

The metric for success was Consistency Rate: The percentage of times the LLM gave the same verdict when the answer order was swapped.

Results: A Massive Boost in Consistency

The results were compelling. PORTIA significantly improved the consistency of every model tested.

Table 1: The main results of PORTIA across LLM evaluators.

Table 1 presents the main findings. Let’s look at a few highlights:

Claude-2: Originally, it had a dismal consistency rate of 28.28% in relation-based comparisons. With PORTIA, that jumped to 83.28%—a relative improvement of nearly 200%.
Llama-2: An open-source local model. It improved from 36.41% to 68.75%.
GPT-4: Even the state-of-the-art model improved. It started high at 93.44% but reached near-perfection at 97.03%.

The column “% Fixed Coverage” is particularly interesting. It tells us what percentage of the originally inconsistent cases were fixed by PORTIA. For GPT-3.5 on Likert-based tasks, PORTIA fixed 96.32% of the errors.

Efficiency and Cost Analysis

One might worry that splitting answers and running alignment algorithms would be computationally expensive. However, the researchers analyzed this trade-off.

Figure 2: Theoretical estimation of PORTIA’s cost with varying k.

Figure 2 shows the cost analysis.

Graph (a): As you increase \(k\) (the number of splits), the input token length increases linearly because you are adding more prompt instructions for each segment.
Graph (b): The computation required to find the optimal split (Semantic Alignment) grows exponentially with \(k\).

Because of this exponential growth, the researchers found that setting \(k=3\) was the sweet spot. It provides enough granularity to align the text without causing a computational explosion.

The Real-World Payoff: The most significant efficiency finding is monetary. By using PORTIA, the cheaper GPT-3.5 model achieved an agreement rate with humans (and GPT-4) that rivaled the standalone GPT-4.

Using PORTIA + GPT-3.5 costs roughly 9.57% of what it costs to use GPT-4. This is a massive enabler for developers who need scalable evaluation but cannot afford GPT-4 prices.

Human Evaluation

Does “Consistent” mean “Correct”? Just because an LLM is consistent doesn’t mean it’s a good judge; it could be consistently wrong.

To verify accuracy, the authors recruited human experts to evaluate the answers. They compared how often the human verdict matched the LLM verdict.

Table 3: Main results from human evaluation comparing the model pair “gpt-3.5-turbo” v.s. “Claude-v1”.

Table 3 reveals a stunning result.

The Human Agreement Rate (HAR) for standalone GPT-4 was 60.00%.
The HAR for GPT-3.5 enhanced by PORTIA was 63.75%.

This implies that a weaker model, when properly aligned with PORTIA, can actually align better with human judgments than a stronger, unaligned model.

Why it Works: Ablation Study

Is the complex Semantic Alignment actually necessary? Could we just stick to the simpler Length Alignment?

The researchers performed an ablation study, removing components of the system to see how performance dropped.

Figure 3: Fixed coverage rate across LLMs for PORTIA and variants w/o Semantic (SA) or Length Alignment (LA).

Figure 3 shows the “Fixed Coverage” (how many errors were fixed) for different configurations.

Blue Bars (Score-based): Removing semantic alignment (No SA) or length alignment (No LA) reduced performance, but length alignment seemed to do a lot of the heavy lifting.
Yellow Bars (Likert-based): This is where Semantic Alignment shone. The Likert scale is nuanced, and simply aligning by length wasn’t enough. The full PORTIA system (green bars implied as baseline) consistently outperformed the stripped-down versions.

The conclusion is that while Length Alignment is fast and effective for simple cases, Semantic Alignment is crucial for complex comparison forms like Likert scales and for ensuring maximum consistency.

Conclusion and Implications

The paper “Split and Merge” introduces a practical, algorithmically sound method to solve one of the most annoying problems in LLM evaluation: position bias.

By treating LLMs more like human readers—who need to break down complex arguments into digestible, comparable chunks—PORTIA achieves three major milestones:

Consistency: It drastically reduces the “flip-flopping” of LLM judges.
Accessibility: It allows open-source (Llama-2) and cheaper closed-source (GPT-3.5) models to perform at a level previously reserved for GPT-4.
Accuracy: It improves alignment with human experts.

For students and practitioners in NLP, this highlights an important lesson: Better prompting and input structuring can sometimes yield gains equivalent to using a model 10x the size. Instead of just waiting for “GPT-5,” we can build smarter systems around the models we have today.

The “Split and Merge” technique is a testament to the power of alignment-based systems. As we move toward more autonomous AI agents, ensuring they are fair, consistent, and unbiased evaluators will be the foundation of trust in AI systems.

Introduction#

The Problem: Position Bias in Pairwise Comparison#

The Inspiration: How Humans Read#

The Core Method: The PORTIA System#

The Pipeline#

The Splitting Algorithm#

Phase 1: Identifying Sentence Boundaries#

Phase 2: Length Alignment#

Phase 3: Semantic Alignment#

Experimental Setup#

Results: A Massive Boost in Consistency#

Efficiency and Cost Analysis#

Human Evaluation#

Why it Works: Ablation Study#

Conclusion and Implications#