Introduction

The rapid rise of Large Language Models (LLMs) like ChatGPT and GitHub Copilot has fundamentally changed the landscape of software development. For professional developers, these tools are powerful productivity boosters. However, for computer science educators, they represent a looming crisis.

In introductory programming courses (often called CS1 and CS2), the primary goal is to teach students the foundational logic of coding—loops, conditionals, and data structures. The problem? LLMs are exceptionally good at these standard problems. A student can copy a prompt, paste it into ChatGPT, and receive a working solution in seconds, bypassing the learning process entirely.

While some educators argue for integrating these tools, the reality is that immediate, unchecked access can hinder the development of core problem-solving skills. Universities have limited options: they can try to detect AI-generated code (which is notoriously unreliable and prone to false positives), or they can try to “LLM-proof” their assignments.

This blog post explores a fascinating research paper titled “Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbation.” The researchers propose a novel defense: instead of banning the AI, what if we change the assignment text in subtle ways that humans can read, but AI models misinterpret?

Original vs. Perturbed Prompt showing how a small typo breaks the model.

As shown in Figure 1 above, a simple modification—removing a few characters to create a typo—can cause a sophisticated model’s performance to drop from 100% accuracy to 0%, while the instruction remains comprehensible to a human student. This technique is known as adversarial perturbation.

In this deep dive, we will walk through the researchers’ three-step process: assessing how well LLMs currently cheat, designing “attacks” to confuse them, and running a real-world user study to see if these tricks actually stop students.

The “Blackbox” Problem

Before we get into the methods, we need to understand the constraints educators face. Teachers generally operate in a “Blackbox” setting regarding LLMs. They cannot change the model’s internal weights, nor can they control the company’s updates or safeguards. The only thing an instructor has total control over is the input—the assignment prompt itself.

The core research question is: How can instructors modify assignment prompts to make them less amenable to LLM-based solutions without impacting their understandability to students?

The researchers approached this systematically via three phases:

  1. Baseline Measurement: How good are current LLMs at university-level assignments?
  2. Perturbation Design: Developing techniques to “poison” the prompt.
  3. Field Experiment: Testing these perturbed prompts with actual students.

Diagram showing the three phases: Checking Performance, Perturbation, and User Study.

Step 1: Measuring the Baseline

You cannot disrupt a system effectively until you understand its capabilities. The researchers collected 58 programming assignments from the University of Arizona’s CS1 and CS2 courses. They categorized these into:

  • Short Problems: Single functions or classes.
  • Long Problems: Complex tasks involving interactions across multiple functions or classes.

They tested five major models: CodeRL, Code Llama, Mistral, GPT-3.5 (ChatGPT), and GitHub Copilot.

The Unexpected Failure of CS1

Surprisingly, the LLMs struggled significantly with the very first introductory course (CS1). In fact, models like GPT-3.5 and CodeRL scored 0% on many CS1 problems.

Why? It turns out that introductory assignments often rely on specific, non-textual contexts that confuse text-based models. Many assignments involved ASCII art or graphical outputs defined by visual patterns rather than pure logic.

Terminal interface showing ASCII art of the Eiffel Tower.

As seen in Figure 5 above, a problem asking a student to print an ASCII Eiffel Tower requires spatial reasoning that is difficult for an LLM to infer solely from text instructions. Additionally, strict requirements on filenames and class structures in these specific assignments tripped up the models.

CS2: The Real Threat

However, the results for the second course (CS2) were much more concerning for educators. As the problems became more about data structures and algorithms (standard CS topics), the models performed much better.

Table showing LLM performance stats. Copilot performs best.

Table 1 highlights that GitHub Copilot was the strongest performer, solving over 51% of short problems correctly. While they struggled with “long” problems (which require maintaining context over a larger codebase), the models were competent enough to be a viable tool for cheating on a significant portion of the curriculum. This established the baseline: the threat is real, particularly for standard algorithmic problems.

Step 2: The Core Method - Designing Perturbations

This is the heart of the paper. The researchers needed to engineer prompt modifications that would degrade LLM performance (efficacy) while keeping the edit distance (the amount of change) low so the text remains readable for students.

The Strategy: Explainability-Guided Attacks

To make the attacks smart, the researchers didn’t just delete random words. They used a technique called SHAP (SHapley Additive exPlanations).

In machine learning, SHAP values tell us which features (in this case, words or tokens) contributed most to the model’s prediction. By running a surrogate model (CodeRL) on the assignments, the researchers could identify the specific “load-bearing” words—the keywords the AI relied on most to generate the code.

Once the high-value tokens were identified, the researchers applied several perturbation strategies:

  1. Visual Adversarial Examples (Unicode): Replacing standard characters with “lookalikes” from the Unicode set. For example, replacing a Latin a with a Cyrillic а. To a human, they look identical. To an LLM tokenizer, they are completely different data points.
  2. Synonym Substitution: Replacing key verbs or nouns with synonyms (e.g., changing “calculate” to “compute” or “tally”).
  3. Typos and Deletions: Removing a character from a critical token (e.g., grid_get_height becomes gri_ge_heigt).
  4. Sentence Removal: Deleting entire sentences that provided context, assuming a human might infer the missing info or that the info was redundant for humans but vital for the AI.

Visualizing the Attack

The Unicode attack is particularly stealthy. In the example below (Figure 7), the prompt looks standard to the naked eye. However, specific characters in function names like DListNode have been swapped for lookalikes.

Side-by-side comparison showing Unicode replacement in a prompt.

Because the LLM treats the prompt as a sequence of tokens, these “typos” (from the machine’s perspective) break the connection between the instruction and the model’s internal knowledge base of programming concepts.

Measuring Efficacy

The researchers defined a specific metric for success called Efficacy. It measures how much the correctness score drops after the perturbation is applied.

Formula for Efficacy.

If a model scored 100% originally, and 0% after the prompt was tweaked, the efficacy is 100%. If the score didn’t change, efficacy is 0%.

Results of the Perturbations

The automated tests showed that these techniques were highly effective.

Table showing efficacy scores. Prompt Unicode and Sentence Remove are high.

As shown in Table 2, techniques like “Prompt (unicode)” (replacing characters throughout the whole text) and “Sentences (remove)” devastated the models’ abilities to answer, often achieving high efficacy.

However, there is a trade-off. We can break the AI by deleting half the assignment or replacing every letter with a symbol, but that makes it unreadable for the student. The researchers measured this using Edit Distance—a calculation of how different the new text is from the original.

Bar chart showing Edit Distance. Prompt Unicode is very high.

Figure 3 illustrates this risk. While “Prompt (unicode)” works well, it changes over 50% of the file’s characters, increasing the risk that a student (or a spell-checker) will notice something is wrong. The ideal perturbation is one with high efficacy but low edit distance (like targeted character removal or synonym swapping).

Step 3: The Field Experiment (User Study)

Theory is one thing; reality is another. Even if an LLM fails an automated test, a clever student using ChatGPT might be able to spot the typo, fix it, and get the answer anyway. To test this, the authors recruited 30 undergraduate students who had already completed CS1/CS2.

The Setup:

  • Students were paid to act as “cheaters.”
  • They were given assignments (some original, some perturbed) and told to use ChatGPT to solve them.
  • They had to submit the chat logs and a survey about whether they noticed anything unusual.

The Human Factor Results

The user study revealed fascinating dynamics between human intuition and AI fragility.

1. Overall Drop in Cheating Success The perturbations worked. The average correctness score of the generated solutions dropped significantly. When combining the best perturbation techniques, the researchers managed to reduce the average correctness score by 77%.

Bar chart showing score drops across models.

2. Detectability vs. Utility The study found a strong correlation between stealth and success.

  • High-Change Perturbations: When the whole prompt was flooded with Unicode characters, students noticed immediately. They described the text as “unusual” or “weird.”
  • Subtle Perturbations: Techniques like removing a single character or swapping a token were harder to spot.

Crucially, the researchers found that detectability does not equal reversibility. Even when students noticed something was wrong (e.g., “The function name looks spelled wrong”), they often didn’t know how to fix the prompt to make ChatGPT understand it, or they trusted the AI’s hallucinated output anyway.

3. Student Handling Strategies When faced with a broken prompt, what did students do?

Stacked bar chart showing student strategies.

As Figure 10 shows, the most common behaviors were:

  • No unusualness found: They blindly copied the text (success for the attacker!).
  • Update problem statement: They tried to manually fix the typos in the chat.
  • Expected to be bypassed: They hoped the AI would figure it out.

The “Prompt (unicode)” strategy was noticed most often, leading students to try and sanitize the text. However, subtle “Character (remove)” perturbations often slipped through, causing the model to generate subtly broken code that the student might assume is correct—until they try to run it.

Conclusion & Implications

This research highlights a pivotal shift in how we think about AI in education. Rather than engaging in an arms race of “AI detection” (which looks for AI fingerprints in student work), educators can engage in adversarial defense—designing assignments that are inherently resistant to AI automation.

The key takeaways are:

  1. LLMs are fragile: They rely on specific token patterns. Breaking those patterns with typos, synonyms, or invisible characters can shut down their reasoning capabilities.
  2. Visuals matter: Assignments that rely on ASCII art, images, or non-textual logic are naturally resistant to text-based LLMs (though multimodal models like GPT-4V are changing this).
  3. The Human-in-the-loop: Students can sometimes repair these attacks, but it requires effort and understanding—which, ironically, forces them to engage with the material, partially achieving the learning objective.

While models will continue to evolve and likely become robust against simple Unicode swaps or typos, this paper provides a proof of concept. The future of homework might not just be about what questions we ask, but how we ask them to ensure that the student, and not the machine, is doing the thinking.