Beyond Prompt Engineering: How "Ghost Tokens" Unlock LLM Potential

If you have spent any time working with Large Language Models (LLMs) like GPT-4 or Llama 2, you are likely familiar with the dark art of Prompt Engineering. We spend hours tweaking phrases, adding “Let’s think step by step,” or restructuring paragraphs just to get the model to output the correct answer. It is a process that feels less like engineering and more like casting spells—change one word, and the magic works; change another, and it fails.

But what if the words aren’t the only thing that matters? What if the empty space between the words is just as important?

In a fascinating new paper titled Position Engineering: Boosting Large Language Models through Positional Information Manipulation, researchers from Microsoft Research introduce a novel technique that might make traditional prompt engineering look antiquated. They demonstrate that by manipulating the “positional indices” of tokens—essentially lying to the model about where words are located—they can significantly boost performance in complex tasks like Retrieval-Augmented Generation (RAG).

In this post, we will deconstruct this paper, explain the math behind “Position Engineering,” and show you how inserting invisible “ghost tokens” can lead to double-digit accuracy gains.

The Coordinates of Language

To understand Position Engineering, we first need to step back and look at how an LLM reads.

When you feed a sentence into an LLM, it doesn’t just see a bag of words. It sees a sequence. The model needs to know that in the sentence “Man bites dog,” “Man” comes before “bites.” To handle this, the Transformer architecture (the backbone of modern LLMs) separates the input into two streams of data:

Semantic Information: What the word means (the token embedding).
Positional Information: Where the word is sitting in the sequence (the positional embedding).

The model’s attention mechanism combines these two pieces of information to decide which words should “attend” to (focus on) which other words.

The Standard Approach: Prompt Engineering

Traditionally, if we want to improve model performance, we modify the text. We add new instructions, rephrase sentences, or provide examples. This changes both the semantic information and the positional information simultaneously because adding a word pushes everything else down the line.

Comparison of prompt engineering and position engineering.

As shown in Figure 1 above, Prompt Engineering (top) physically changes the text. You add Sentence 3, and the paragraph changes.

Position Engineering (bottom) is different. The researchers propose inserting Placeholder Tokens (PH Tokens). These are “ghost” tokens. They do not contain any text, and they do not participate in the calculation of meaning. However, they do occupy an index number.

By inserting these placeholders, the model thinks that “Sentence 2” is very far away from “Sentence 1,” even though no actual text separates them.

The Core Method: How It Works

This technique relies on hacking the Attention Mechanism. Let’s look at the math to see why this works.

In a standard attention layer, the model computes three vectors for every token: a Query (\(q\)), a Key (\(k\)), and a Value (\(v\)).

Standard attention query, key, and value equations.

Here, \(e_m\) is the embedding of the token, and \(m\) is its position index. The attention score (\(a_{m,n}\)) determines how much the token at position \(m\) cares about the token at position \(n\).

Attention score calculation equation.

The crucial part is that the attention score depends on the dot product of the Query (\(q\)) and the Key (\(k\)).

The Role of Position Embeddings (RoPE)

Most modern models, like Llama 2 and Mistral, use a specific type of positioning called Rotary Positional Embeddings (RoPE). Instead of just adding a number to the vector, RoPE actually rotates the vector in geometric space based on its position index.

RoPE equations showing rotation matrices.

In the equation above, \(\mathbf{R}_m^d\) is a rotation matrix determined by the position \(m\). This means the angle of the vector depends on where the word sits in the sentence. When the model compares two words, it looks at the relative distance between them (\(n - m\)).

Relative position attention equation.

This is the mechanism Position Engineering exploits. By changing the indices \(m\) and \(n\), the researchers effectively change the rotation of the vectors. This alters the resulting dot product, thereby manually adjusting the attention weights between different parts of the prompt.

The “Ghost” Function

The researchers define a position editing function, \(\tau\), which re-maps the indices.

Position editing function equation.

If you insert a placeholder token (a “ghost token”), the attention mechanism ignores it for the calculation of values, but the indices of the subsequent words are shifted.

For example, if you have an Instruction followed by a Document, usually the Document starts immediately after the Instruction (e.g., index 10). If you perform Position Engineering by adding 100 placeholder tokens, the Instruction ends at index 9, but the Document now starts at index 110.

To the model (specifically via RoPE), the Document is now “far away” from the Instruction. As we will see, this distance allows the model to process the segments more independently, reducing confusion.

Experiment 1: Retrieval-Augmented Generation (RAG)

The authors first tested this on Retrieval-Augmented Generation (RAG). In RAG, you typically have a prompt structure that looks like this:

Instruction: “Answer the question based on the documents…”
Documents: The retrieved text snippets.
Question: The user’s actual query.

The challenge in RAG is that models often get “lost in the middle” or over-prioritize the instructions while ignoring the nuances of the documents.

The Setup

The researchers introduced two variables, \(\theta_A\) and \(\theta_B\), representing the number of placeholder tokens (ghost tokens) inserted into the prompt.

Diagram of Position Engineering for RAG showing theta_A and theta_B.

\(\theta_A\): The gap between the Instruction and the Documents.
\(\theta_B\): The gap between the Documents and the Question.

They tested values ranging from 0 to 2500 tokens using the Llama2-13B-chat model.

The Results

The results were surprisingly effective. By simply adjusting these invisible gaps, the accuracy of the model improved significantly across multiple datasets (NQ Open, TrivialQA, etc.).

Table of results for RAG tasks.

As Table 1 shows, the improvements are substantial. For the “WebQuestions” dataset with 1 document, the accuracy jumped by 15.4%.

Notice the pattern in the optimal values (\(\theta_A^*\) and \(\theta_B^*\)):

\(\theta_A\) (Instruction-to-Document gap) tends to be large (1000–2000).
\(\theta_B\) (Document-to-Question gap) tends to be smaller (300–500).

This suggests that the model performs best when the Instruction is pushed far away from the Documents, essentially “decoupling” the general rules from the specific content.

Finding the Universal “Sweet Spot”

Is there a magic number that works for everything? The researchers aggregated their results to create a heatmap of performance based on the values of \(\theta_A\) and \(\theta_B\).

Heatmap of average percentile values for positional configurations.

Figure 3 reveals a clear “safe zone.” A high \(\theta_A\) (around 1900) and a moderate \(\theta_B\) (around 400) consistently yield high performance. The researchers found that this specific configuration (\(\theta_A = 1900, \theta_B = 400\)) worked universally well across different datasets.

Table showing universal position configuration results.

Why not just delete the instruction?

You might wonder: if pushing the instruction 2000 tokens away helps, is it because the instruction is bad? Should we just delete it?

The authors tested this. They ran the prompts without the instruction segment entirely.

Table comparing baseline to removing instructions.

As Table 3 shows, removing the instruction entirely yields results similar to the baseline (no position engineering). This proves that the instruction is necessary, but Position Engineering allows the model to balance the influence of the instruction against the content of the documents more effectively. It’s about finding the right “attention distance,” not removing the content.

Experiment 2: In-Context Learning (ICL)

The second scenario tested was In-Context Learning (Few-Shot Prompting). This is where you give the model a few examples of a task (e.g., “Review: Good movie -> Positive”) before asking it to solve a new one.

The prompt structure involves:

Instruction
Examples (Demonstrations)
Query

The researchers introduced gaps between the instruction and examples (\(\theta_A\)), between the examples and the query (\(\theta_B\)), and even between the examples themselves (\(\theta_{mid}\)).

Diagram of Position Engineering for ICL.

The Results

Once again, Position Engineering improved performance, though the optimal settings were different than RAG.

For the TREC dataset (question classification), the model preferred a small gap between examples (\(\theta_{mid} = 40\)) and no gaps elsewhere. For SST2 (sentiment analysis), it preferred a small gap before the query (\(\theta_B = 100\)).

While the gains were smaller than in RAG (1.9% to 3.6%), they further prove that “where” tokens sit matters just as much as “what” the tokens are.

Discussion: The Physics of Attention

Why does inserting 1900 “ghost tokens” make a model smarter?

The authors hypothesize that Position Engineering works by finely adjusting attention weights. In models using RoPE, attention scores usually decay as the relative distance between tokens increases. This is a property known as “long-term decay.”

By artificially increasing the distance between the Instruction and the Documents (using a large \(\theta_A\)), we are essentially lowering the attention volume of the instruction. We aren’t muting it entirely (which leads to poor performance), but we are turning it down enough so that the model can focus more intensely on the Documents.

Similarly, placing the Question moderately close to the Documents (a smaller \(\theta_B\)) ensures the model relies heavily on the retrieved text to form its answer.

Advantages of Position Engineering

Computational Efficiency: Unlike adding real words or retrieving more documents, Position Engineering does not increase the computational load. The “ghost tokens” are skipped during calculation; only the indices of the real tokens are changed.
Orthogonal to Prompt Engineering: You don’t have to choose between rewriting your prompt and using Position Engineering. You can do both.
Simpler Optimization: Searching for the best integer (number of gaps) is mathematically easier than searching for the best English sentence.

Conclusion

The paper Position Engineering opens a new frontier in how we interact with Large Language Models. It moves us away from the subjective art of “word smithing” toward a more quantitative science of “attention management.”

The key takeaways for students and practitioners are:

Indices Matter: A token’s position is a distinct input feature that can be manipulated independently of the text.
Space Creates Focus: decoupling different parts of a prompt (Instruction vs. Data) by inserting index gaps can reduce interference and improve accuracy.
Universal Settings: For RAG tasks on Llama-2 based models, try inserting a large index gap (~1900) after your instructions.

As models continue to evolve with larger context windows, techniques like Position Engineering will likely become standard tools in the developer’s toolkit—allowing us to guide these massive neural networks not just by what we say, but by where we place our words in the vast mathematical space of attention.

The Coordinates of Language#

The Standard Approach: Prompt Engineering#

The Core Method: How It Works#

The Role of Position Embeddings (RoPE)#

The “Ghost” Function#

Experiment 1: Retrieval-Augmented Generation (RAG)#

The Setup#

The Results#

Finding the Universal “Sweet Spot”#

Why not just delete the instruction?#

Experiment 2: In-Context Learning (ICL)#

The Results#

Discussion: The Physics of Attention#

Advantages of Position Engineering#

Conclusion#

The Coordinates of Language

The Standard Approach: Prompt Engineering

The Core Method: How It Works

The Role of Position Embeddings (RoPE)

The “Ghost” Function

Experiment 1: Retrieval-Augmented Generation (RAG)

The Setup

The Results

Finding the Universal “Sweet Spot”

Why not just delete the instruction?

Experiment 2: In-Context Learning (ICL)

The Results

Discussion: The Physics of Attention

Advantages of Position Engineering

Conclusion