Fashion is an intrinsic part of human culture, serving as a shield against the elements and a canvas for self-expression. However, the backend of the fashion industry—specifically the creation of sewing patterns—remains a surprisingly manual and technical bottleneck. While generative AI has revolutionized 2D image creation (think Midjourney or DALL-E), generating manufacturable garments is a different beast entirely.
A sewing pattern isn’t just a picture of a dress; it is a complex set of 2D panels with precise geometric relationships that must stitch together to form a 3D shape. To date, AI models for fashion have been “single-modal,” meaning they could perhaps turn a 3D scan into a pattern, or text into a pattern, but they lacked the flexibility to understand images, text, and geometry simultaneously.
Enter AIpparel, a new research contribution from Stanford University and ETH Zürich. AIpparel represents a significant leap forward: it is a multimodal foundation model capable of generating and editing sewing patterns using text, images, or both.

In this post, we will deconstruct how the researchers built AIpparel, the novel “language” they created to teach an AI how to sew, and the impressive results that outperform existing state-of-the-art methods.
The Challenge: Why Sewing Patterns are Hard for AI
To understand why this paper is significant, we first need to understand the data. A sewing pattern consists of flat 2D shapes (panels) and instructions on how to stitch edges together. Designing these requires a mental mapping between 2D geometry and 3D draping physics.
Previous attempts to automate this typically utilized single-modal approaches:
- Image-to-Garment: Trying to guess the pattern from a photo.
- Text-to-Garment: Generating a pattern from a description.
These methods often struggled with the complexity of real-world garments. Furthermore, training a “smart” model requires massive amounts of data. While the internet is full of images, there is no massive, public repository of multimodal sewing pattern data (e.g., a pattern paired with a photo and a text description and editing instructions).
The researchers faced two main hurdles:
- Data Scarcity: No large-scale multimodal sewing dataset existed.
- Representation: How do you feed a complex geometric sewing pattern into a Large Language Model (LLM) that is designed to process text?
The Foundation: The GCD-MM Dataset
Before building the model, the researchers had to build the data. They extended an existing dataset called GarmentCodeData (GCD) to create GCD-MM (GarmentCodeData-Multimodal).
This is currently the largest multimodal sewing pattern dataset available, containing over 120,000 unique garments. Crucially, the researchers didn’t just collect patterns; they annotated them extensively.

As shown in the table above, previous datasets either lacked text descriptions or editing instructions. The GCD-MM dataset includes:
- Text Descriptions: Detailed captions generated using a rule-based system refined by GPT-4o to ensure accuracy (avoiding hallucinations common in standard captioning).
- Images: Rendered views of the draped garments.
- Editing Instructions: Pairs of patterns representing a “before” and “after” state (e.g., a dress before and after adding a hood), along with the text instruction for that edit.
This rich data environment allowed the model to learn the relationships between visual looks, textual descriptions, and the underlying geometric structures.
The Method: Teaching an LLM to “Speak” Sewing
The core innovation of AIpparel is how it retargets a Large Multimodal Model (LMM) to understand and generate sewing patterns. The researchers chose LLaVA 1.5-7B as their base model. LLaVA is already capable of understanding images and text, but it has no inherent knowledge of vector graphics or sewing patterns.
To fix this, the authors developed a novel Tokenization Scheme.
1. Treating Patterns as Language
LLMs work by predicting the next “token” (word or sub-word) in a sequence. The researchers converted sewing patterns into a sequence of tokens that acts like a drawing command script.
They introduced special tokens to structure the data:
<SoG>/<EoG>: Start/End of Garment.<SoP>/<EoP>: Start/End of Panel.- Curve types:
<L>(Line),<Q>(Quadratic Bézier),<B>(Cubic Bézier),<A>(Arc). - Stitching tags: Tokens like
<t1>,<t2>that indicate which edges are sewn together.
A panel is represented as a sequence of these tokens. For example, a simple panel might look like this code sequence:

The mathematical formulation for the entire garment tokenization is represented as:

2. Hybrid Architecture: Discrete Tokens & Continuous Numbers
Standard LLMs struggle with precise numbers (continuous parameters). If you ask an LLM to predict a vertex coordinate like 14.523, it usually treats it as text, which leads to precision errors. In sewing, a millimeter difference can ruin the fit.
To solve this, AIpparel uses a hybrid approach combining Classification and Regression.

As illustrated in the architecture diagram above:
- The Transformer predicts the discrete tokens (e.g., “Draw a Line”, “Start a Sleeve”).
- Regression Heads (small neural networks attached to the transformer) use the hidden state of those tokens to predict the continuous parameters (e.g., the exact X,Y coordinates of the line’s endpoint, or the 3D rotation of the panel).
This means the model says “Draw a Line here” (Discrete) and “Here are the exact coordinates” (Continuous) simultaneously.
3. The Training Objective
The model is trained using a composite loss function. It simultaneously tries to minimize the error in predicting the correct token (Cross-Entropy Loss) and the error in the geometric coordinates (L2 Loss).

This training strategy ensures that the generated patterns are not only syntactically correct (valid files) but also geometrically precise (valid clothes).
Experiments and Results
The researchers evaluated AIpparel across several tasks, comparing it against state-of-the-art baselines like SewFormer (specialized for images) and DressCode (specialized for text).
Task 1: Image-to-Garment Prediction
In this task, the model is given an image of a garment and must generate the corresponding sewing pattern.

The qualitative results in Figure 3 are striking. Notice the “Draping Failed” notes on the SewFormer baseline. This indicates that the baseline generated disjointed or invalid panels that couldn’t even be simulated. AIpparel, conversely, generated simulation-ready patterns that closely matched the visual input, handling details like waistbands and sleeve cuffs.
Task 2: Multimodal Generation
One of the unique strengths of AIpparel is its ability to handle multimodal prompts—for example, giving the model a text description and a reference image simultaneously.

In the comparison above, baselines (which are single-modal models augmented with GPT-4 or DALL-E adapters) struggle to merge the conflicting or complementary information. AIpparel, trained natively on multimodal data, produces a garment that respects the visual structure of the image while incorporating the specific design details requested in the text.
The quantitative data supports this visual evidence:

Task 3: Language-Instructed Editing
Perhaps the most useful application for designers is editing. A user can input an existing sewing pattern and a text instruction like “make the skirt longer” or “add a hood.”

As shown in Figure 5, AIpparel successfully integrates a hood into a tank top pattern (top row) and elongates a skirt (bottom row). The baseline models often hallucinate completely new garments rather than modifying the existing one because they lack a deep understanding of the pattern’s structural “syntax.”
Ablation Study: Why the New Tokenizer Matters
The authors also tested whether their new tokenization scheme (using regression heads for numbers) was actually better than the standard approach of discretizing numbers into text tokens (used by DressCode).

The difference is clear. Purely discrete tokenizers (DressCode) struggle with smooth curves and precise symmetries, resulting in artifacts. AIpparel’s hybrid approach maintains the geometric integrity of the vector graphics.
Conclusion
AIpparel establishes a new benchmark for generative fashion. By treating sewing patterns as a language and training a Large Multimodal Model on a massive, annotated dataset, the researchers have created a system that can design manufacturable garments from vague descriptions, images, or specific editing commands.
Key Takeaways:
- Multimodality is Key: Combining text, image, and geometric data leads to far superior understanding than single-modal approaches.
- Hybrid Tokenization: For engineering tasks involving precise dimensions (like sewing or CAD), standard LLM tokenization is insufficient. The use of regression heads to handle continuous parameters is a powerful technique.
- Data-Driven Design: The creation of the GCD-MM dataset is a contribution in itself, enabling future research in this domain.
While limitations remain—such as the difficulty of handling non-manifold geometries like pockets—AIpparel paves the way for a future where designing a custom, fitted garment is as easy as typing a prompt or uploading a sketch. This bridges the gap between the “digital knowledge” of the web and the “physical reality” of clothing fabrication.
](https://deep-paper.org/en/paper/2412.03937/images/cover.png)