Unlocking Judicial Fairness: How LLMs and Legal Knowledge are Revolutionizing Case Retrieval

In the legal world, the concept of stare decisis—standing by things decided—is foundational. For judges and lawyers, sourcing relevant precedents is not just a research task; it is a critical requirement for upholding judicial fairness. If a judge cannot find a past case that mirrors the current one, the consistency of the law is at risk.

However, finding that “needle in a haystack” is becoming increasingly difficult. Legal case retrieval is significantly different from typing a question into Google. The “queries” are often entire case documents describing a new situation, and the “documents” to be retrieved are lengthy, complex judgments from the past. These documents are filled with specialized jargon, intricate procedural details, and often contain multiple different crimes within a single text.

Traditional information retrieval methods and even modern standard language models struggle here. They either get overwhelmed by the text length (truncating the important bits) or fail to understand the specific legal nuance required to match a query to a precedent.

In this post, we are doing a deep dive into KELLER (Knowledge-guidEd case reformuLation for LEgal case Retrieval), a new approach presented by researchers from Renmin University of China. This method creatively combines Large Language Models (LLMs) with professional legal knowledge to “reformulate” messy case documents into structured, meaningful facts. The result is a retrieval system that is not only more accurate but also—crucially—interpretable for legal professionals.

To understand why KELLER is necessary, we first need to look at the data. In standard web search, a query might be “best pizza in New York” (short) and the document is a blog post (medium). In legal retrieval, the query is a Legal Case and the target is a Legal Case.

As shown in the image below, a legal case is a structured document containing several distinct sections: Procedure, Fact, Reasoning, Decision, and Tail.

The query case and candidate document case examples. The query case typically contains only partial content since it has not been adjudicated. Extractable crimes and law articles are highlighted in red.

The “Fact” section (highlighted in red) is usually the most important for finding similar cases. However, notice the complexity. A single case might involve arson, but it could also involve inheritance disputes or assault.

The challenges are threefold:

  1. Length: These documents often exceed the input limits of popular retrieval models like BERT (which usually cap at 512 tokens).
  2. Multi-fact Complexity: A defendant might be charged with multiple crimes (e.g., drug trafficking and illegal possession of firearms). A relevant precedent might match only one of those crimes. Standard models often average everything out, losing the specific signal needed for a match.
  3. Lack of Expert Knowledge: Generic summarization tools don’t know what is legally significant. They might summarize a 5,000-word document into a 100-word paragraph that captures the “story” but misses the specific legal elements (like which specific law article was violated) that actually determine relevance.

The Solution: KELLER

The researchers propose that we shouldn’t just feed raw text into a model. Instead, we should use the reasoning capabilities of LLMs, guided by explicit legal knowledge, to rewrite the case into a format that is easier for a retrieval model to digest.

The KELLER framework operates in three main stages:

  1. Knowledge-Guided Case Reformulation: Breaking the case down into “sub-facts.”
  2. Relevance Modeling: Scoring how well the sub-facts of a query match the sub-facts of a candidate document.
  3. Dual-Level Contrastive Learning: Training the model to understand connections at both the broad case level and the granular sub-fact level.

Here is the high-level overview of the architecture:

Overview of KELLER. We first perform legal knowledge-guided prompting to reformulate the legal cases into a series of crucial and concise sub-facts. Then, we directly model the case relevance based on the sub-facts. The model is trained at both the coarse-grained case level and the fine-grained sub-fact level via contrastive learning.

Let’s break down these distinct components to understand how they work together.

1. Knowledge-Guided Case Reformulation

This is the heart of the paper’s contribution. The goal is to transform a long, messy legal narrative into a set of clean, concise sub-facts.

If you ask a generic AI to “summarize this case,” it might give you a blur of events. KELLER takes a more structured, two-step approach using “Prompting.”

Step 1: Extraction

First, the system prompts the LLM to act as a legal expert. It scans the full text to extract Crimes and Law Articles.

  • Input: The full case text.
  • Output: A list of specific crimes (e.g., “The crime of arson”) and the specific articles of the Criminal Law involved.

This is easier for an LLM than full summarization because these elements are usually distinct and identifiable. The researchers also use a database of legal knowledge to ensure the extracted crimes map correctly to the law articles.

Step 2: Guided Summarization

Now that the system knows what crimes are in the document, it performs the summarization. But here is the trick: it doesn’t summarize the whole text at once. It performs summarization per crime.

The prompt basically asks: “Given that this case involves the Crime of Arson and violates Article 114, please summarize the specific facts in the text related to this crime.”

The result is a set of sub-facts. If a case involves three different crimes, the output is three distinct, concise text snippets, each describing the factual basis for one crime. This solves the “long text” problem (the snippets are short) and the “multi-fact” problem (each crime is handled separately).

2. Relevance Modeling

Once the query case and the candidate documents are reformulated into these clean sub-facts, how do we calculate if they are a match?

KELLER doesn’t rely on a single similarity score for the whole document. Instead, it builds a Similarity Matrix.

First, every sub-fact is encoded into a vector embedding using a text encoder (specifically, a pre-trained legal model called SAILER).

Equation for encoding query and document sub-facts

In this equation, \(E_{q_i}\) is the embedding for the \(i\)-th sub-fact of the query, and \(E_{d_j}\) is the embedding for the \(j\)-th sub-fact of the candidate document.

Next, the model calculates the similarity between every query sub-fact and every document sub-fact using the dot product.

Equation for calculating the similarity matrix

If the query has 3 sub-facts and the document has 4 sub-facts, this results in a \(3 \times 4\) matrix of scores.

Aggregation: MaxSim and Sum

To get a final score for the document, KELLER uses a “MaxSim” and “Sum” logic.

Imagine a query involves “Robbery.” The candidate document involves “Robbery” and “Tax Evasion.”

  • The model looks at the “Robbery” sub-fact in the query.
  • It compares it to both “Robbery” and “Tax Evasion” in the document.
  • It takes the Maximum score (MaxSim). Obviously, Robbery matches Robbery better than Tax Evasion.
  • It does this for every sub-fact in the query and Sums up the max scores.

Equation for aggregating scores using MaxSim and Sum

This approach is highly effective because it mimics how a lawyer thinks: “Does this precedent contain the specific legal element I am looking for? If yes, it’s relevant, regardless of what other irrelevant stuff is in there.”

3. Dual-Level Contrastive Learning

To train this model, the researchers use Contrastive Learning. The basic idea is to pull positive (matching) pairs closer together in vector space and push negative (non-matching) pairs apart.

KELLER innovates by doing this at two levels.

Level 1: Case-Level Training

This uses the standard ground-truth labels provided in the dataset. If human experts say “Case A is relevant to Case B,” the model is trained to maximize their similarity score.

Equation for case-level ranking loss

Level 2: Sub-Fact-Level Training

This is the tricky part. The datasets provide labels for whole cases, but they don’t tell us which specific sub-facts match. To train the granular understanding of the model, the researchers needed to invent a heuristic strategy to generate “silver” labels for sub-facts.

They use the extracted crime types to determine matches.

  • If a query sub-fact involves “Theft,” and a document sub-fact involves “Theft,” they are treated as a Positive Pair.
  • If a query sub-fact involves “Theft,” and a document sub-fact involves “Assault,” they are treated as a Negative Pair.

This logic is visualized below:

Illustration of our proposed sub-fact-level contrastive learning. The green and red squares represent the positive pairs and negative pairs, respectively.

This creates a rich set of training signals, allowing the model to learn fine-grained semantic matching even without expensive human annotation at the sub-fact level.

The final loss function combines both the case-level signal (\(L_R\)) and the sub-fact signal (\(L_S\)).

Equation for total loss function

The sub-fact loss calculation itself follows a similar ranking logic:

Equation for sub-fact level ranking loss


Experimental Results

The researchers evaluated KELLER on two major Chinese legal case retrieval benchmarks: LeCaRD and LeCaRDv2. These datasets are annotated by legal experts, making them the gold standard for this task.

Main Performance

The results were impressive. KELLER outperformed all baselines, including traditional methods like BM25, general models like BERT, and even specialized long-text legal models like Lawformer.

Table 1: Main results of the fine-tuned setting on LeCaRD and LeCaRDv2.

In Table 1 above, you can see KELLER achieves the highest MAP (Mean Average Precision) and NDCG scores. The “dagger” symbol (\(\dagger\)) indicates that these improvements are statistically significant. This proves that breaking the case down into sub-facts is superior to trying to shove the whole long text into a model.

Robustness (Zero-Shot)

One of the biggest problems in legal AI is the lack of training data. A model that only works after seeing thousands of examples isn’t always practical. The researchers tested KELLER in a “Zero-shot” setting (without fine-tuning on the target training set).

Table 2: Zero-shot performance on LeCaRD and LeCaRDv2.

As shown in Table 2, KELLER still dominates. This suggests that the “reformulation” process using LLMs provides a strong enough signal that the model works well even without extensive specific training.

Handling “Controversial” Queries

Not all legal cases are created equal. Some are “Common” (straightforward), while others are “Controversial” (complex cases that might have been retried or have conflicting interpretations).

The researchers broke down performance by query type.

Figure 3: Evaluation on different query types. We evaluate four models on (a) LeCaRD and (b) LeCaRDv2.

Figure 3 highlights a key strength: KELLER (the purple bars) shows massive improvements over baselines specifically on Controversial queries. While other models degrade when the case gets complicated, KELLER’s ability to isolate specific sub-facts allows it to handle complexity much more gracefully.

Component Analysis

Is every part of KELLER necessary? The ablation study (removing one piece at a time) confirms that yes, the whole system is required.

Table 3: Results of ablation study on LeCaRDv2.

Table 3 shows that replacing the “Knowledge-Guided” reformulation with naive summarization (KGCR -> NS) causes a huge drop in performance. This proves that simply summarizing text isn’t enough—you need the legal expert knowledge (crimes/articles) to guide the summarization.


Interpretability: The “Why” Matters

In high-stakes domains like law (or medicine), a black-box AI is dangerous. A lawyer cannot go to a judge and say, “This case is relevant because the AI said score 0.98.” They need to point to specific facts.

Because KELLER matches sub-facts to sub-facts, it offers inherent interpretability. We can visualize exactly which part of the query triggered the match with the document.

Figure 4: An example of the interpretability of KELLER. Figure 5: Comparison of text versions.

Look at Figure 4 (top half of the image above). It shows a heatmap.

  • Query Sub-fact \(q_1\) (related to stealing) matches strongly with Document Sub-fact \(d_1\) (related to robbing).
  • Query Sub-fact \(q_2\) (related to fleeing) matches strongly with Document Sub-fact \(d_2\) (related to evading arrest).

This gives the user a clear rationale: “This case was retrieved because the theft details match, and the evasion details match.”

The bottom half of the image (Figure 5) compares the text quality. The “Naive Summarization” (black box) glosses over details. The “Knowledge-Guided Reformulation” (green box) clearly separates the text into “Transporting Drugs,” “Illegal Possession of Drugs,” and “Illegal Possession of Firearms.” This structure ensures no criminal behavior is overlooked.

Conclusion

The KELLER framework represents a significant step forward in Legal Information Retrieval. By acknowledging that legal cases are not just “long text” but “structured collections of facts,” the researchers were able to design a system that respects the domain’s complexity.

Key takeaways:

  1. Guidance Matters: LLMs are powerful, but they are better when guided by domain knowledge (like specific law articles) rather than just being asked to “summarize.”
  2. Granularity Wins: Breaking documents into sub-facts allows for matching specific legal elements, which is critical for complex cases with multiple crimes.
  3. Interpretability is Possible: We don’t have to sacrifice understanding for performance. By using explicit sub-fact matching, we get better results and a clearer explanation of why those results were chosen.

As AI continues to integrate into the judicial system, approaches like KELLER—which prioritize structure, legal logic, and interpretability—will be essential tools for ensuring that technology supports, rather than obscures, the pursuit of justice.