Why Legal AI Needs a Plan: Introducing LexKeyPlan

Artificial Intelligence is reshaping the legal landscape. From drafting contracts to summarizing case briefs, Large Language Models (LLMs) are passing bar exams and performing statutory reasoning at impressive levels. However, if you are a law student or a legal professional, you know that “impressive” isn’t good enough when the stakes are high. In law, precision is everything.

The Achilles’ heel of modern LLMs is hallucination. A model might write a beautifully persuasive argument but cite a case that doesn’t exist or apply a legal doctrine that was overruled ten years ago.

To fix this, researchers typically use Retrieval-Augmented Generation (RAG). This technique connects the AI to a database of real documents, allowing it to “look up” facts before writing. But there is a subtle, critical flaw in how standard RAG works for long-form writing: it only looks backward. It uses what has already been written to find documents, which often fails to capture what the model needs to write next.

In this post, we will dive deep into a fascinating new paper titled “LexKeyPlan: Planning with Keyphrases and Retrieval Augmentation for Legal Text Generation.” We will explore how this new framework teaches AI to “plan ahead” using keyphrases, ensuring that legal arguments are not just coherent, but factually grounded in the correct case law.

The Problem: The “Rearview Mirror” Effect

To understand why LexKeyPlan is necessary, we first need to understand the limitations of current methods.

The Limits of Parametric Knowledge

Standard LLMs (like GPT-4 or Llama) rely on “parametric knowledge”—information stored in their weights during training. While they memorize a lot, they cannot memorize every case law nuance, and they cannot access up-to-date rulings. When they don’t know an answer, they often make one up to satisfy the user’s prompt. In the legal domain, this is dangerous.

The Shortcomings of Standard RAG

RAG attempts to solve this by retrieving external documents. In a typical RAG setup for generating text:

The model looks at the Input Context (what has been written so far).
It uses that context as a search query to find relevant documents.
It generates the next sentence based on the context and the retrieved documents.

This works well for answering simple questions. However, imagine an AI drafting a complex judgment for the European Court of Human Rights (ECHR).

The model might be reading a section on “The Facts” of a case involving a protest. Based on the facts alone, the RAG system might retrieve general cases about protests. But what if the legal reasoning (the next section to be written) needs to hinge on a specific, nuanced doctrine like “hate speech exclusion under Article 17”? The facts might not contain the specific keywords needed to find those distinct legal precedents.

Because standard RAG relies on the past (context) to retrieve information, it often misses the mark on what is needed for the future (the intended response). It is like trying to drive a car while looking only in the rearview mirror.

The Solution: LexKeyPlan

The researchers from the Technical University of Munich propose LexKeyPlan, a novel framework that introduces an anticipatory planning stage.

Instead of jumping straight from context to retrieval, LexKeyPlan asks the model to pause and ask: “What are the key legal concepts I need to discuss next?”

The Three-Step Framework

LexKeyPlan breaks the generation process into three distinct steps:

Content Planning (The Blueprint): The model analyzes the input context (e.g., the facts of the case) and generates a list of keyphrases. These keyphrases represent the legal concepts or specific terms that define the future content of the response. This is the “forward-looking plan.”
Retrieval (The Search): Instead of using the long, noisy context as a search query, the system uses the generated keyphrases. These keyphrases act as a precise search query to fetch relevant documents (like prior court judgments) from an external database.
Generation (The Execution): Finally, the model generates the actual text (the response). It is conditioned on three things:

The original input context.
The generated content plan (keyphrases).
The retrieved documents.

By explicitly planning what to talk about before deciding where to look for information, the model aligns its retrieval mechanism with its intended legal reasoning.

Why Keyphrases?

You might wonder, why use keyphrases? Why not generate a full sentence as a plan? Keyphrases serve as a high-level abstraction. They are dense with information but low in noise. In the legal domain, phrases like “margin of appreciation,” “legitimate aim,” or “necessary in a democratic society” carry immense weight. They act as distinct anchors that guide the retrieval system to the exact cluster of relevant case laws.

Training the Framework

The researchers needed to teach the model how to generate these plans. Since standard datasets don’t come with “future plans” attached, they had to be creative.

1. The Plan Generator

They took the target text (the actual judgment written by human judges) and extracted keyphrases from it using two algorithms:

TextRank: A graph-based algorithm (similar to Google’s PageRank) that finds important words based on how they connect to other words.
KeyBERT: A method that uses embeddings (vector representations of words) to find keywords that are semantically similar to the document.

These extracted keywords served as the “ground truth” to train the model. Essentially, they taught the model: “When you see these facts, you should predict these keywords.”

2. The Retriever

They experimented with two types of retrieval systems:

BM25: A standard lexical search (like a basic search engine) that matches exact words. This works surprisingly well in law because legal terms are precise.
GTR (Generalizable T5-based Retriever): A dense retriever that looks for semantic meaning rather than exact word matches.

Experiments & Results

To validate LexKeyPlan, the authors used the ECHR CaseLaw dataset. This dataset contains thousands of cases from the European Court of Human Rights.

The task was to generate “The Law” section (the legal reasoning) based on “The Facts” section of a case. This is a difficult task requiring the model to bridge the gap between raw events and legal conclusions.

Quantitative Analysis

The researchers compared LexKeyPlan against several baselines, including standard fine-tuning (no retrieval) and standard RAG (retrieval based on context). They used metrics like ROUGE (text overlap), BERTScore (semantic similarity), and AlignScore (factual consistency).

Let’s look at the results in Table 1.

Table 1: Comparison of LexKeyPlan with Baseline Approaches. The “Plan Generator” column specifies the keyphrase extraction algorithm used to generate supervision signals. The “Retriever” column indicates the retrieval method employed.

Key Takeaways from Table 1:

Planning Alone Helps (Rows b & c): Even without retrieving documents, simply generating a plan (keyphrases) improved the Coherence and Fluency of the text compared to the baseline (Row a). This suggests that breaking the task into “think then speak” reduces the cognitive load on the model.
Retrieval Adds Accuracy (Rows d & e): Standard RAG improved the factual alignment scores, but it often hurt coherence. The model struggled to weave the retrieved text into a smooth narrative.
LexKeyPlan Wins (Rows f - i): The full framework—planning plus retrieval—achieved the best results across the board. Specifically, using KeyBERT for planning and GTR for retrieval (Row i) yielded the highest alignment and coherence scores.

The data proves that when the model anticipates the future content via keyphrases, it retrieves more relevant documents, which in turn leads to more accurate and readable legal arguments.

Zero-Shot Performance

The researchers also tested whether this “planning” logic works without specific training (Zero-Shot). They prompted a standard Mistral-7B model to generate plans and use them.

Table 2: Effect of Integrating Keyphrase-Based Content Planning and Retrieval Augmentation in Zero-Shot Setting.

As shown in Table 2, the trend holds. While zero-shot models struggle with complex instructions, the addition of planning (rows d, e, f) generally helps structure the output better than raw generation, although the gains are less dramatic than in the fine-tuned version. This highlights that for specialized tasks like law, fine-tuning on the specific planning workflow is highly beneficial.

Quality of the Plan

Does the model actually generate good plans? Or is it just guessing random legal words? The researchers measured how well the generated keyphrases matched the actual topics in the target text.

Table 3: Evaluation of Content Plan Quality.

Table 3 reveals a crucial insight: Fine-tuning matters. The fine-tuned models (especially utilizing KeyBERT supervision) generated keyphrases that had a high semantic similarity (0.78) to the actual future content. This confirms that the model successfully learned to “predict” the legal direction of the case based on the facts.

A Real-World Case Study

To truly appreciate LexKeyPlan, let’s look at a qualitative example described in the paper involving Freedom of Expression (Article 10).

The Scenario: An individual is convicted for inflammatory speech on social media and claims their right to freedom of expression was violated.

The Standard RAG Failure: The standard model sees the context “social media” and “speech.” It retrieves general cases about freedom of expression (like Handyside v. UK). It generates a judgment saying, “We must balance the user’s rights against public morals.”

The Error: This misses a critical legal threshold. If speech incites violence, it might be excluded entirely under Article 17 (Abuse of Rights), meaning Article 10 doesn’t even apply.

The LexKeyPlan Success: The LexKeyPlan model first generates keyphrases like “Article 17,” “incitement to violence,” and “abuse of rights.”

The Retrieval: These specific keys allow the retriever to ignore generic speech cases and instead find Garaudy v. France (a case about hate speech exclusion).
The Result: The model writes: “Before applying Article 10, we must determine if this constitutes hate speech under Article 17. As established in Garaudy v. France, speech denying historical atrocities is not protected.”

This distinction—knowing which legal test to apply—is the difference between a passing and failing grade in law school, and the difference between justice and error in the real world.

Conclusion and Implications

LexKeyPlan demonstrates a powerful concept for the future of AI: Anticipatory Reasoning.

By decoupling the “what to say” (planning) from the “how to say it” (generation), and using the plan to drive the information search, we can build AI systems that are far more reliable.

Why This Matters for Students

If you are studying NLP or Law, this paper highlights the shift from “black box” generation to structured generation. We are moving away from simply asking a model to “write X” and toward systems that “plan X, research X, and then write X.”

Limitations to Keep in Mind

The authors candidly note that while LexKeyPlan is an improvement, it isn’t perfect.

Temporal Relevance: The system uses BM25/GTR which are general retrievers. They might retrieve a highly relevant case that was overruled last year. Legal-specific retrieval needs to account for the authority and timeline of cases.
Domain Adaptation: The keyphrase extractors (KeyBERT) are general-purpose. Developing extractors specifically trained on legal taxonomies could boost performance even further.

LexKeyPlan is a significant step toward “Copilot” systems that lawyers can actually trust—systems that don’t just guess the law, but research it with the foresight of a legal professional.

Why Legal AI Needs a Plan: Introducing LexKeyPlan#

The Problem: The “Rearview Mirror” Effect#

The Limits of Parametric Knowledge#

The Shortcomings of Standard RAG#

The Solution: LexKeyPlan#

The Three-Step Framework#

Why Keyphrases?#

Training the Framework#

1. The Plan Generator#

2. The Retriever#

Experiments & Results#

Quantitative Analysis#

Zero-Shot Performance#

Quality of the Plan#

A Real-World Case Study#

Conclusion and Implications#

Why This Matters for Students#

Limitations to Keep in Mind#