Beyond Text: How Images and Metadata Are Revolutionizing E-Commerce Translation

Imagine walking into a store and seeing a label that just says “Pen.”

If you are standing in the stationery aisle, you immediately know it’s a writing instrument. But if you are standing in the farming supplies section, that same word—“pen”—likely refers to an enclosure for animals. The word hasn’t changed, but the context has shifted the meaning entirely.

This ambiguity is the arch-nemesis of Machine Translation (MT). For years, Neural Machine Translation (NMT) systems, like the ones powering Google Translate or DeepL, have translated sentences in isolation. They treat text as a vacuum, ignoring the visual or categorical world surrounding it. While this works well for generic documents, it often fails in the high-stakes, nuance-heavy world of e-commerce.

Today, we are diving deep into a fascinating research paper titled “ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT.” The researchers tackle the specific challenge of translating product listings from Czech to Polish. They investigate whether showing an AI model a picture of the product, or telling it the product’s category, helps it generate better translations.

In this post, we will explore the limitations of traditional text-only translation, break down the architecture of Multimodal Machine Translation (MMT), and analyze the results of three distinct experiments to see if context really is king.

The Problem: When Text Isn’t Enough

Neural Machine Translation has made massive strides since the introduction of the Transformer architecture in 2017. These models use “attention mechanisms” to understand relationships between words in a sentence. However, standard NMT models operate at the sentence level. They don’t “see” what they are translating.

This is particularly problematic in e-commerce for two reasons:

Ambiguity: As mentioned with the “pen” example, products often have polysemous names (words with multiple meanings). A “mouse” could be a computer peripheral or a pet toy. A “driver” could be a golf club or a software tool.
Data Quality: E-commerce text is notoriously messy. Product titles are often just lists of keywords (“Nike Shoes Running Fast Blue Size 10”), and descriptions can be grammatically fragmented. Without visual context, even a human translator might struggle to decipher exactly what is being sold.

To solve this, the field has moved toward Multimodal Machine Translation (MMT). MMT attempts to mimic human understanding by integrating visual information (images) alongside text. If the model sees a picture of a rabbit in a cage, it knows the “pen” is an enclosure, not a writing tool.

However, research in this area has been stalled by a lack of good data. High-quality datasets that align source text, target text, and images are rare, especially for language pairs like Czech and Polish. This is where the ConECT project comes in.

The Foundation: The ConECT Dataset

Before they could train better models, the researchers had to build a better playground. They introduced ConECT (Contextual E-Commerce Translation), a new dataset specifically designed to test context-aware translation.

The dataset focuses on the Czech-to-Polish language pair. These two Slavic languages are related, but they have distinct vocabularies and grammatical structures that can easily trip up an automated system.

The researchers extracted data from two major e-commerce platforms: allegro.pl and mall.cz. They didn’t just scrape text; they curated a rich ecosystem of data points. For every entry, they collected:

The Source Text (Czech): Product names, descriptions, and offer titles.
The Target Text (Polish): High-quality translations.
The Product Image: The main visual representation of the item.
The Category Path: The hierarchical breadcrumb trail of the product (e.g., “Sports » Bicycles » Tires”).

Let’s look at the statistics of this dataset to understand the scale of the work.

Table 1: ConECT dataset statistics.Len. denotes average length in words.Polish sentences have similar statistics.

As shown in Table 1 above, the dataset is split into different content types. Notice the distinction between “Offer titles,” “Product descriptions,” and “Product names.”

Offer Titles are often “clickbaity” and marketing-heavy.
Product Names are concise and factual.
Product Descriptions are longer, full sentences.

The variety here is crucial. A model that translates a factual product name well might fail completely when trying to translate a persuasive, punchy offer title. By segmenting the data this way, the researchers can see exactly where context helps the most.

The Core Method: Three Ways to Inject Context

Once the dataset was ready, the researchers set out to answer a specific question: What is the best way to give context to a translation model?

They designed three distinct experimental approaches, ranging from advanced vision models to clever text-based hacks.

Figure 1: We evaluate three methods for contextualisation in e-commerce MT: (1) combining images with text in VLM, (2) appending category path context in NMT, and (3) a cascade approach consisting of a vision Q&A and a text-to-text NMT.

Figure 1 provides a high-level overview of these three strategies. Let’s break them down step-by-step.

Method 1: The Vision-Language Model (VLM)

Illustrated in Figure 1, top path (1).

This is the most “true” form of Multimodal Machine Translation. The researchers used a model called PaliGemma, a Vision-Language Model (VLM).

In a traditional text translation model, you input text, and you get text out. In a VLM, the input is a combination of Text + Image.

Input: The model receives the Czech product text (e.g., “Durable Animal Pen…”) and the actual pixel data of the product image (the rabbits in the cage).
Processing: The model processes both inputs simultaneously. It uses the visual features of the image to resolve ambiguities in the text.
Output: It generates the Polish translation.

To prove that the model was actually using the image (and not just ignoring it), the researchers ran a control experiment. They tested the model with the correct product image, and then tested it again using a “black image” (a blank, black square). If the model with the real image performed better than the one with the black square, it proves the visual data provides real value.

Method 2: Category Path Context

Illustrated in Figure 1, middle path (2).

Running large vision models is expensive and computationally heavy. The researchers wondered: Do we actually need the pixels, or just the information contained in the image?

Often, the “context” of an image is summarized in the store’s category path. Knowing an item is in “Pet Supplies” is just as useful as seeing a picture of a pet.

In this method, they stuck to a traditional text-only NMT model but used a clever formatting trick. They appended the category path to the beginning of the source sentence using special tokens.

Original Input: Big Star men's sports shoes...
Context-Aware Input: <SC> Fashion <SEP> Shoes <SEP> Men's <SEP> Sports <EC> Big Star men's sports shoes...

Here, <SC> stands for Start Category, <SEP> separates the sub-categories, and <EC> marks the End of the Category. This forces the translation model to “read” the category first, priming its internal state with the correct context before it attempts to translate the product name.

Method 3: The Cascade Approach (Synthetic Image Descriptions)

Illustrated in Figure 1, bottom path (3).

This approach attempts to bridge the gap between the first two methods. What if you want to use the visual content of the image, but you want to use a standard text-based translation model?

The solution is a two-step “cascade”:

Step 1 (VQA): Use a Visual Question Answering model to look at the image and generate a text description (a caption).
Step 2 (NMT): Take that generated caption and append it to the source text, similar to how the category path was added in Method 2.

The researchers used prompts to generate these descriptions.

Table 3: Prompts for image description generation used with the paligemma-3b-mix-224 model.

As shown in Table 3, they prompted the model to “describe the image in Czech.” The resulting description (e.g., “white and brown rabbit in a cage”) was then wrapped in <SD> (Start Description) and <ED> tags and fed into the translator.

The hypothesis was that this would provide the semantic richness of the image without requiring a massive VLM for the final translation step.

Experiments & Results: What Actually Worked?

The researchers evaluated their models using two standard metrics:

chrF: A character-based metric that checks how much the characters in the machine translation overlap with a professional human translation.
COMET: A more advanced, neural-based metric that is trained to predict human quality judgments. It captures meaning better than simple word matching.

The results, presented in Table 2, offer some surprising insights.

Table 2: Comparison of the results on the ConECT test set shows that the VLM model with image context and the NMT model with category paths achieved improved performance due to the added context. However, experiments with synthetic image descriptions led to a decrease in metrics.

Let’s analyze the winners and losers of these experiments.

1. Real Images vs. Black Images (VLM Success)

Look at the section “PaliGemma-3b” in the table. The researchers compared “real img” (the actual product photo) against “black img” (the blank square).

Result: The model using real images consistently scored higher on both chrF and COMET metrics across almost all categories.
Takeaway: The model is not hallucinating; it is actively using visual features to improve the translation. The “All sets” COMET score rose from 0.9095 (black image) to 0.9152 (real image). While this seems small numerically, in the world of high-precision MT, this is a solid confirmation that visual context resolves ambiguity.

2. The Power of Text Metadata (Category Paths)

Now, look at the “Category paths experiments.” Here, they compared a baseline model (no context) against the model with the category prefixes (<SC>...<EC>).

Result: This method was extremely effective. The “Category context” model achieved a COMET score of 0.9362, beating the baseline of 0.9311.
Takeaway: This is a crucial finding for the industry. While Vision-Language Models are exciting, simply feeding the category tree (which every e-commerce site already has) into a text-based model yields excellent results. It suggests that for e-commerce, the “context” we need is often categoric rather than visual.

3. The Failure of the Cascade (Image Descriptions)

Finally, look at the “Image desc. experiments.” This is the method where they generated a caption and added it to the text.

Result: Performance dropped significantly. The COMET score plummeted to 0.8219 compared to the baseline of 0.9341.
Takeaway: This is a classic example of “more data isn’t always better.” The researchers noted that this approach likely introduced noise. If the VQA model generates a slightly inaccurate description, or if the description focuses on irrelevant background details, it confuses the translation model rather than helping it. This phenomenon is known as “error propagation”—a mistake in step 1 (captioning) poisons step 2 (translation).

Conclusion and Implications

The ConECT paper provides a roadmap for the future of specialized translation. It moves us away from the idea that we should just throw more text data at a model and hope for the best.

Key Takeaways for Students:

Context is Quantifiable: We can measure exactly how much an image or a category tag helps translation. The experiments proved that visual data does resolve ambiguity.
Metadata is Gold: You don’t always need the most complex AI model. The “Category Path” experiment showed that leveraging existing structural data (like site categories) is a computationally cheap way to get state-of-the-art results.
Beware of Cascades: Chaining models together (Model A describes image -> Model B translates text) is risky. If Model A hallucinates, Model B fails. End-to-end systems (like the VLM used in Method 1) are generally more robust because they learn to align the image and text directly.

By releasing the ConECT dataset, these researchers have opened the door for others to experiment with Czech-Polish translation. But more importantly, they have demonstrated that in the future of AI, understanding language will require looking beyond just the words on the page. Whether through pixels or category tags, context is the key to understanding.

The Problem: When Text Isn’t Enough#

The Foundation: The ConECT Dataset#

The Core Method: Three Ways to Inject Context#

Method 1: The Vision-Language Model (VLM)#

Method 2: Category Path Context#

Method 3: The Cascade Approach (Synthetic Image Descriptions)#

Experiments & Results: What Actually Worked?#

1. Real Images vs. Black Images (VLM Success)#

2. The Power of Text Metadata (Category Paths)#

3. The Failure of the Cascade (Image Descriptions)#

Conclusion and Implications#