If you have ever worked in data science, you know the “80/20 rule”: you spend 80% of your time cleaning and preparing data, and only 20% actually analyzing it or building models.

Data Preprocessing (DP) is the unglamorous backbone of the data pipeline. It involves fixing spelling errors, filling in missing values, matching records from different databases, and standardizing formats. Traditionally, this has been handled by a fragmented ecosystem of specialized tools—one algorithm for error detection, a completely different one for entity matching, and so on.

Recently, Large Language Models (LLMs) like GPT-4 have shown incredible promise as “universal solvers” for these tasks. They can read a messy row of data, understand the context, and fix it. However, this introduces a massive bottleneck: Data Privacy.

Sending sensitive enterprise data—customer records, financial transactions, or medical history—to a public API like OpenAI’s is often a non-starter for organizations concerned with data breaches and compliance.

This brings us to Jellyfish, a new research contribution from Osaka University and NEC Corporation. The researchers propose a method to instruction-tune local LLMs (like Mistral and Llama) specifically for data preprocessing. The result is a secure, efficient, and highly capable model that runs on your own GPU, often outperforming traditional methods and rivaling giant cloud-based models.

In this post, we will deconstruct how Jellyfish works, how it transforms raw data into instruction prompts, and why it might be the future of secure data cleaning.

The Landscape of Data Preprocessing

Before understanding Jellyfish, we need to categorize the messiness of real-world data. The paper focuses on four core DP tasks that usually require distinct approaches:

  1. Error Detection (ED): Spotting typos, inconsistencies, or values that don’t make sense (e.g., an age of 200).
  2. Data Imputation (DI): Inferring and filling in missing values based on context.
  3. Schema Matching (SM): Figuring out that the column “DOB” in Database A is the same as “Birth_Date” in Database B.
  4. Entity Matching (EM): Determining that “Apple Inc.” and “Apple Computer Company” refer to the same real-world entity.

Until recently, you needed separate machine learning models for each of these. LLMs changed the game because they possess “general knowledge” and fuzzy reasoning capabilities. However, off-the-shelf local models (like a base Llama 3 model) aren’t naturally good at these structured data tasks, and cloud models (like GPT-4) are privacy nightmares for sensitive data.

The Jellyfish Solution

The researchers introduce Jellyfish, a family of models (7B, 8B, and 13B parameters) built by fine-tuning open-source bases (Mistral, Llama 3, and OpenOrca-Platypus2).

The goal was to create a Universal DP Task Solver that:

  • Runs Locally: Operates on a single, low-priced GPU (consumer-grade hardware), ensuring data never leaves the premise.
  • Is Customizable: Allows users to inject specific domain knowledge.
  • Is Interpretable: Doesn’t just fix the data but explains why.

The Architecture: From Raw Data to Instruction Tuning

The core innovation of Jellyfish isn’t just the model itself, but the pipeline used to create it. Training an LLM to clean data isn’t as simple as feeding it CSV files. The structured data must be “serialized” into natural language instructions that the model can understand and learn from.

Figure 1: Overview of instruction tuning for data preprocessing

As shown in Figure 1 above, the process involves two main stages: Tuning and Inference.

  1. Tuning (Left): Raw datasets from various tasks (ED, DI, SM, EM) are converted into instruction data. This isn’t just a copy-paste job; the researchers apply Knowledge Injection and Reasoning Data generation (using a larger “teacher” model, Mixtral) to create rich training examples. These are fed into base models to create the Jellyfish variants.
  2. Inference (Right): The tuned Jellyfish model can then handle new, unseen datasets and even unseen tasks (like Column Type Annotation) by following similar prompt structures.

The Anatomy of a Jellyfish Prompt

How do you translate a database row into a language prompt? The researchers developed a structured serialization method.

Figure 2: Example prompt in instruction data.

Figure 2 illustrates the prompt structure used during training and inference. It consists of five key components:

  1. System Message: Sets the behavior (e.g., “You are an AI assistant…”).
  2. Task Description: Clearly defines what the model needs to do (e.g., “Determine whether two Products listed below are the same”).
  3. Injected Knowledge: This is a crucial innovation. The prompt includes rules or hints, such as “Missing values (N/A) should not be used as a basis for your decision.” This helps the model avoid common pitfalls, like thinking two records match just because they both have “NULL” in the phone number field.
  4. Instance Content: The actual data row(s), serialized into text (e.g., Product A: [name: "..." factory: "..."]).
  5. Question & Output Format: A specific question and constraints on how to answer (e.g., “Choose your answer from: [Yes, No]”).

By standardizing this format, the model learns a consistent pattern for analyzing structured data, regardless of whether it’s looking at restaurant reviews or medical records.

The “Teacher-Student” Approach: Distilling Reasoning

One of the most interesting aspects of Jellyfish is how it learns to explain its decisions. Small models (7B parameters) often struggle with complex reasoning. To solve this, the researchers used a technique called Knowledge Distillation.

They took a larger, smarter open-source model (Mixtral-8x7B) and fed it the DP tasks, asking it to generate detailed explanations for the correct answers. These explanations were then treated as “ground truth” reasoning data to train the smaller Jellyfish models.

Essentially, Jellyfish was taught not just what the answer is (Yes/No), but how a smarter model thinks through the problem.

Experimental Results

The researchers compared Jellyfish against two groups of competitors:

  1. Non-LLM Methods: Traditional machine learning tools specialized for specific tasks (e.g., HoloDetect for errors, Ditto for entity matching).
  2. LLM Methods: General-purpose giants like GPT-3.5, GPT-4, and other fine-tuned table models.

Performance on Core Tasks

The results were highly promising. Despite running on local hardware with a fraction of the parameters of GPT-4, Jellyfish held its own.

Table 4: DP performance on seen tasks.

Table 4 highlights several key findings:

  • Jellyfish vs. Non-LLMs: On almost every dataset, Jellyfish-13B outperforms the specialized non-LLM methods (the “Best of non-LLM” column). This is significant because those non-LLM methods usually require training specifically on that dataset, whereas Jellyfish is acting as a generalist.
  • Jellyfish vs. GPT-4: While GPT-4 remains the king in raw performance for many tasks (indicated by bold numbers), Jellyfish-13B is surprisingly competitive, often beating GPT-3.5 and occasionally even surpassing GPT-4 (such as on the CMS dataset for Schema Matching).
  • The Power of Specialization: Jellyfish-13B consistently outperforms the much larger Solar 70B and Stable Beluga 2 70B (referenced in the broader study), proving that a smaller, specialized model is often better than a massive generalist model for specific domain tasks.

The Impact of Multi-Task Learning

A fascinating question in LLM tuning is: “Does learning to detect errors help the model learn to match entities?” The answer appears to be yes.

Figure 5: Impact of tuning with multi-task data on DP performance.

Figure 5 (above) shows the performance of the 13B model as the training data includes more tasks. The point labeled (1, 1, 1, 1) represents the model trained on all four tasks (ED, DI, SM, EM). The trend is clear: Cross-task learning works.

When the model learns to identify errors (ED), it gains a better understanding of what valid data looks like, which helps it impute missing values (DI). When it learns to match schemas (SM), it understands attribute relationships better, helping it match entities (EM). This synergy confirms that building a “Universal Solver” is a valid strategy.

The Reasoning Paradox

We mentioned earlier that Jellyfish uses “Reasoning Data” (explanations) during training. However, the impact of this data varied depending on the model size.

Figure 4: Impact of reasoning data on DP performance.

Figure 4 reveals an interesting nuance:

  • 7B and 8B Models (Orange/Blue lines): Adding a moderate amount of reasoning data (around 8k-14k instances) improves performance. These smaller models benefit from being “taught how to think.”
  • 13B Model (Green line): Performance initially drops significantly when reasoning data is added.

Why? The researchers hypothesize that the base model for the 13B version (OpenOrca-Platypus2) already had strong logic capabilities that didn’t align well with the specific logic required for DP tasks, causing a conflict during fine-tuning. Consequently, the final Jellyfish-13B model was tuned without reasoning data to maximize task accuracy, while the 7B/8B versions retained it to serve as “interpreters.”

Interpretation: Better than GPT-3.5?

One of the main selling points of Jellyfish is that it can provide detailed justifications for its data cleaning decisions—crucial for auditing.

The researchers conducted a head-to-head comparison where GPT-4o acted as a judge, comparing the explanations generated by Jellyfish-8B vs. GPT-3.5.

Table 13: Head-to-head comparison of GPT-3.5 and Jellyfish-7B/8B.

As shown in Table 13, Jellyfish-8B crushed GPT-3.5, winning nearly 73% of the head-to-head comparisons.

In qualitative examples, while GPT-3.5 would simply state “Prices are different, so products are different,” Jellyfish would provide deeper context, noting nuances like “Product A is a specific software version for Mac, while Product B is a comprehensive suite.” This level of detail is vital when users need to trust an automated system to alter their data.

Why This Matters: Privacy and Customization

The technical achievements of Jellyfish are impressive, but the practical implications are what make this paper vital for the industry.

  1. Data Sovereignty: By proving that a 7B-13B parameter model can handle complex cleaning tasks, Jellyfish enables companies to clean sensitive financial or health data entirely offline. You don’t need an Azure or OpenAI contract to clean your data with LLM-level intelligence.
  2. Cost Efficiency: While tuning requires some compute, inference on a local GPU is free after the hardware cost. For massive datasets, paying per-token API fees to GPT-4 is prohibitively expensive. Jellyfish runs on a single A100 or even consumer cards (with quantization), making it scalable.
  3. Extensibility: Because the method relies on instruction tuning, organizations can clone this approach. A hospital could inject specific medical knowledge into the prompts to create a “Medical-Jellyfish” that understands that a heart rate of 300 is an error, but a cholesterol level of 300 is just high.

Conclusion

Jellyfish represents a shift in how we think about applying LLMs to utility tasks. We don’t always need the largest, smartest model in the cloud. For specific, high-stakes steps in the data pipeline like preprocessing, a smaller, locally tuned model can offer better performance, better explanations, and vastly superior security.

By combining clever data serialization, knowledge injection, and reasoning distillation, the authors have created a blueprint for the future of automated data cleaning—one where the “garbage collector” of data science is an intelligent, articulate, and secure AI agent.

If you are interested in trying these models, the researchers have made them available on Hugging Face, allowing anyone with a GPU to start cleaning their data locally today.