AIpparel: The First Foundation Model for Designing Digital Fashion

Fashion is an intrinsic part of human culture, serving as a shield against the elements and a canvas for self-expression. However, the backend of the fashion industry—specifically the creation of sewing patterns—remains a surprisingly manual and technical bottleneck. While generative AI has revolutionized 2D image creation (think Midjourney or DALL-E), generating manufacturable garments is a different beast entirely.

A sewing pattern isn’t just a picture of a dress; it is a complex set of 2D panels with precise geometric relationships that must stitch together to form a 3D shape. To date, AI models for fashion have been “single-modal,” meaning they could perhaps turn a 3D scan into a pattern, or text into a pattern, but they lacked the flexibility to understand images, text, and geometry simultaneously.

Enter AIpparel, a new research contribution from Stanford University and ETH Zürich. AIpparel represents a significant leap forward: it is a multimodal foundation model capable of generating and editing sewing patterns using text, images, or both.

AIpparel Overview. This figure showcases the model’s ability to take multimodal inputs—such as a text description like “A knee-length jumpsuit” or editing instructions like “asymmetric sleeve”—and generate high-quality, simulatable sewing patterns.

In this post, we will deconstruct how the researchers built AIpparel, the novel “language” they created to teach an AI how to sew, and the impressive results that outperform existing state-of-the-art methods.

The Challenge: Why Sewing Patterns are Hard for AI

To understand why this paper is significant, we first need to understand the data. A sewing pattern consists of flat 2D shapes (panels) and instructions on how to stitch edges together. Designing these requires a mental mapping between 2D geometry and 3D draping physics.

Previous attempts to automate this typically utilized single-modal approaches:

Image-to-Garment: Trying to guess the pattern from a photo.
Text-to-Garment: Generating a pattern from a description.

These methods often struggled with the complexity of real-world garments. Furthermore, training a “smart” model requires massive amounts of data. While the internet is full of images, there is no massive, public repository of multimodal sewing pattern data (e.g., a pattern paired with a photo and a text description and editing instructions).

The researchers faced two main hurdles:

Data Scarcity: No large-scale multimodal sewing dataset existed.
Representation: How do you feed a complex geometric sewing pattern into a Large Language Model (LLM) that is designed to process text?

The Foundation: The GCD-MM Dataset

Before building the model, the researchers had to build the data. They extended an existing dataset called GarmentCodeData (GCD) to create GCD-MM (GarmentCodeData-Multimodal).

This is currently the largest multimodal sewing pattern dataset available, containing over 120,000 unique garments. Crucially, the researchers didn’t just collect patterns; they annotated them extensively.

Dataset Comparison. This table compares GCD-MM with previous datasets. Note that GCD-MM is the only one offering Text, Image, and Editing instruction annotations simultaneously, at a scale of 120k samples.

As shown in the table above, previous datasets either lacked text descriptions or editing instructions. The GCD-MM dataset includes:

Text Descriptions: Detailed captions generated using a rule-based system refined by GPT-4o to ensure accuracy (avoiding hallucinations common in standard captioning).
Images: Rendered views of the draped garments.
Editing Instructions: Pairs of patterns representing a “before” and “after” state (e.g., a dress before and after adding a hood), along with the text instruction for that edit.

This rich data environment allowed the model to learn the relationships between visual looks, textual descriptions, and the underlying geometric structures.

The Method: Teaching an LLM to “Speak” Sewing

The core innovation of AIpparel is how it retargets a Large Multimodal Model (LMM) to understand and generate sewing patterns. The researchers chose LLaVA 1.5-7B as their base model. LLaVA is already capable of understanding images and text, but it has no inherent knowledge of vector graphics or sewing patterns.

To fix this, the authors developed a novel Tokenization Scheme.

1. Treating Patterns as Language

LLMs work by predicting the next “token” (word or sub-word) in a sequence. The researchers converted sewing patterns into a sequence of tokens that acts like a drawing command script.

They introduced special tokens to structure the data:

<SoG> / <EoG>: Start/End of Garment.
<SoP> / <EoP>: Start/End of Panel.
Curve types: <L> (Line), <Q> (Quadratic Bézier), <B> (Cubic Bézier), <A> (Arc).
Stitching tags: Tokens like <t1>, <t2> that indicate which edges are sewn together.

A panel is represented as a sequence of these tokens. For example, a simple panel might look like this code sequence:

Token Sequence Equation. This shows how a panel is represented as a sequence of tokens: Start of Panel, Panel Name, Rotation, followed by edge commands like Line and Stitching Tags.

The mathematical formulation for the entire garment tokenization is represented as:

Tokenization Equation. The full garment is a sequence starting with SoG, followed by a series of Panel encodings, and ending with EoG.

2. Hybrid Architecture: Discrete Tokens & Continuous Numbers

Standard LLMs struggle with precise numbers (continuous parameters). If you ask an LLM to predict a vertex coordinate like 14.523, it usually treats it as text, which leads to precision errors. In sewing, a millimeter difference can ruin the fit.

To solve this, AIpparel uses a hybrid approach combining Classification and Regression.

Architecture Diagram. The model takes text and image inputs (left). These pass through the AIpparel model (center). The output is a sequence of tokens. Crucially, specific tokens trigger “Regression Heads” (top right) that predict precise numerical data like vertex positions and 3D transformations.

As illustrated in the architecture diagram above:

The Transformer predicts the discrete tokens (e.g., “Draw a Line”, “Start a Sleeve”).
Regression Heads (small neural networks attached to the transformer) use the hidden state of those tokens to predict the continuous parameters (e.g., the exact X,Y coordinates of the line’s endpoint, or the 3D rotation of the panel).

This means the model says “Draw a Line here” (Discrete) and “Here are the exact coordinates” (Continuous) simultaneously.

3. The Training Objective

The model is trained using a composite loss function. It simultaneously tries to minimize the error in predicting the correct token (Cross-Entropy Loss) and the error in the geometric coordinates (L2 Loss).

Loss Function Equation. The total loss (L) is the sum of the Cross-Entropy loss for token prediction and the L2 Euclidean distance loss for edge coordinates and panel rotations.

This training strategy ensures that the generated patterns are not only syntactically correct (valid files) but also geometrically precise (valid clothes).

Experiments and Results

The researchers evaluated AIpparel across several tasks, comparing it against state-of-the-art baselines like SewFormer (specialized for images) and DressCode (specialized for text).

Task 1: Image-to-Garment Prediction

In this task, the model is given an image of a garment and must generate the corresponding sewing pattern.

Image-to-Garment Qualitative Results. On the left (GCD-MM), AIpparel successfully reconstructs complex patterns where SewFormer fails. On the right (SewFactory), AIpparel correctly identifies garment types (e.g., pants vs skirt) where the baseline struggles.

The qualitative results in Figure 3 are striking. Notice the “Draping Failed” notes on the SewFormer baseline. This indicates that the baseline generated disjointed or invalid panels that couldn’t even be simulated. AIpparel, conversely, generated simulation-ready patterns that closely matched the visual input, handling details like waistbands and sleeve cuffs.

Task 2: Multimodal Generation

One of the unique strengths of AIpparel is its ability to handle multimodal prompts—for example, giving the model a text description and a reference image simultaneously.

Multimodal Results. In the “Text + Image” column, AIpparel generates a dress that respects both the image (silhouette) and the text description, whereas baselines fail to integrate the information coherently.

In the comparison above, baselines (which are single-modal models augmented with GPT-4 or DALL-E adapters) struggle to merge the conflicting or complementary information. AIpparel, trained natively on multimodal data, produces a garment that respects the visual structure of the image while incorporating the specific design details requested in the text.

The quantitative data supports this visual evidence:

Multimodal Quantitative Table. AIpparel achieves significantly higher accuracy and lower Panel L2 error compared to augmented versions of SewFormer and DressCode.

Task 3: Language-Instructed Editing

Perhaps the most useful application for designers is editing. A user can input an existing sewing pattern and a text instruction like “make the skirt longer” or “add a hood.”

Editing Results. Top row: “Include a hood.” Bottom row: Skirt lengthening. AIpparel modifies the specific geometry required without destroying the rest of the garment’s structure.

As shown in Figure 5, AIpparel successfully integrates a hood into a tank top pattern (top row) and elongates a skirt (bottom row). The baseline models often hallucinate completely new garments rather than modifying the existing one because they lack a deep understanding of the pattern’s structural “syntax.”

Ablation Study: Why the New Tokenizer Matters

The authors also tested whether their new tokenization scheme (using regression heads for numbers) was actually better than the standard approach of discretizing numbers into text tokens (used by DressCode).

Ablation Qualitative. DressCode’s tokenizer (second column) produces jagged, unrealistic shapes. AIpparel’s tokenizer (third column) produces clean, geometric lines that align with the text input “flared hem.”

The difference is clear. Purely discrete tokenizers (DressCode) struggle with smooth curves and precise symmetries, resulting in artifacts. AIpparel’s hybrid approach maintains the geometric integrity of the vector graphics.

Conclusion

AIpparel establishes a new benchmark for generative fashion. By treating sewing patterns as a language and training a Large Multimodal Model on a massive, annotated dataset, the researchers have created a system that can design manufacturable garments from vague descriptions, images, or specific editing commands.

Key Takeaways:

Multimodality is Key: Combining text, image, and geometric data leads to far superior understanding than single-modal approaches.
Hybrid Tokenization: For engineering tasks involving precise dimensions (like sewing or CAD), standard LLM tokenization is insufficient. The use of regression heads to handle continuous parameters is a powerful technique.
Data-Driven Design: The creation of the GCD-MM dataset is a contribution in itself, enabling future research in this domain.

While limitations remain—such as the difficulty of handling non-manifold geometries like pockets—AIpparel paves the way for a future where designing a custom, fitted garment is as easy as typing a prompt or uploading a sketch. This bridges the gap between the “digital knowledge” of the web and the “physical reality” of clothing fabrication.

The Challenge: Why Sewing Patterns are Hard for AI#

The Foundation: The GCD-MM Dataset#

The Method: Teaching an LLM to “Speak” Sewing#

1. Treating Patterns as Language#

2. Hybrid Architecture: Discrete Tokens & Continuous Numbers#

3. The Training Objective#

Experiments and Results#

Task 1: Image-to-Garment Prediction#

Task 2: Multimodal Generation#

Task 3: Language-Instructed Editing#

Ablation Study: Why the New Tokenizer Matters#

Conclusion#