One Tokenizer to Rule Them All? A Deep Dive into ATOKEN for Images, Videos, and 3D

Introduction: The Quest for a Universal Language of Vision

In the world of AI, Large Language Models (LLMs) like GPT-4 have become masters of generalization. A single model can write code, translate languages, and reason about complex topics. A key ingredient in this success is the humble tokenizer—a component that breaks down all forms of text (code, prose, tables) into a shared, unified set of tokens. This “universal language” allows models to scale efficiently and transfer knowledge seamlessly across tasks.

But what about vision? While AI can generate stunning images and understand complex scenes, the visual AI ecosystem remains fragmented. The models we use to generate images (like the VAE in Stable Diffusion) are fundamentally different from the ones we use to understand them (like CLIP). Moreover, most models are specialized for a single modality: an image model can’t process video, and a video model has no concept of 3D geometry. This fragmentation prevents the kind of cross-task generalization we see in LLMs.

What if we could create a single, unified “language” for all visual data? Researchers at Apple propose exactly this in their recent paper, introducing ATOKEN—the first visual tokenizer designed to unify not just different tasks (generation and understanding) but also different modalities: images, videos, and 3D assets.

ATOKEN provides a unified representation for images, videos, and 3D assets, enabling both high-fidelity reconstruction and strong semantic understanding from a single model.

Figure 1: ATOKEN provides a unified representation for images, videos, and 3D assets, enabling both high-fidelity reconstruction and strong semantic understanding from one model.

ATOKEN encodes these diverse inputs into a shared latent space—a “meeting ground” where pixels, motion, and geometry coexist. In this article, we’ll explore ATOKEN’s design, from its clever 4D representation to its innovative training strategy, and examine why it might be a crucial step toward the next generation of truly multimodal AI.

Background: The Fragmented World of Visual AI

To appreciate ATOKEN’s achievement, it helps to understand the landscape it aims to unify. Visual AI has long been split along two fault lines: task specialization and modality fragmentation.

The Great Divide: Reconstruction vs. Understanding

Visual tokenizers traditionally fall into one of two camps:

Reconstruction-Focused Tokenizers: Models like VAEs and VQ-GANs (used in generative systems) specialize in compression and reconstruction. They preserve low-level visual details like textures and colors but lack semantic understanding—you can recreate pixels precisely, but not extract meaning.
Understanding-Focused Encoders: Models like CLIP and SigLIP produce high-level semantic embeddings for pairing images with text. They excel at classification and retrieval but discard fine-grained pixel details, making original reconstruction impossible.

The divide means that a model which can describe an image cannot generate or edit it, and a model that can generate realistic images cannot deeply reason about them.

A Tower of Babel: Images, Videos, and 3D

The second fragmentation is modality. Images are 2D grids, videos are sequences of 2D grids over time, and 3D assets can be meshes, voxels, or Gaussian splats. Historically, each modality has had dedicated architectures—3D convolution for motion, graph networks for geometry—making it difficult to learn from all formats with one model.

A comparison of existing visual tokenizers, highlighting their specialization in tasks and modalities. ATOKEN is the first to provide unified support across reconstruction, understanding, images, videos, and 3D.

Table 1: Comparison of existing visual tokenizers. Most specialize in one task and one modality; ATOKEN is the first to provide unified support across reconstruction, understanding, and all three modalities.

The Core Method: How ATOKEN Unifies Vision

ATOKEN’s solution rests on four pillars:

Unified 4D representation
Pure transformer architecture
Adversarial-free training objective
Progressive training curriculum

An overview of the ATOKEN architecture. All visual inputs are converted into a sparse 4D representation, processed by a shared transformer encoder, and then used for both reconstruction (via the decoder) and understanding (via text alignment).

Figure 2: All visual inputs are converted into a sparse 4D representation, processed by a shared transformer encoder, then used for both reconstruction and semantic alignment.

1. A Unified 4D Latent Space

ATOKEN represents images, videos, and 3D assets as sparse points in a 4D coordinate space (t, x, y, z):

Images: A single 2D slice with t=0 and z=0.
Videos: Stacks of slices along the t axis, with z=0.
3D Assets: Occupy volume in (x, y, z) space, with t=0.

Instead of dense grids, ATOKEN uses sparse sets of feature–position pairs: z = {(z_i, p_i)}. The model processes only active locations, making it efficient and flexible.

The same representation powers reconstruction via a decoder (individual z_i vectors) and understanding via pooled global embeddings for text alignment.

2. A Pure Transformer Architecture

ATOKEN adapts a robust 2D encoder, SigLIP2, to process 4D data:

Space-Time Patching: Input is split into t × p × p patches, enabling unified handling of images (t=1) and videos.
4D Rotary Position Embeddings (RoPE): Gives the model relative positioning in all four dimensions, allowing native resolution and duration flexibility.

For 3D assets, ATOKEN uses a multi-view rendering pipeline inspired by Trellis-SLAT: render views, tokenize them as images, and project features into a 3D voxel grid.

ATOKEN’s 3D tokenization pipeline. It renders multi-view 2D images of a 3D asset, tokenizes them, and aggregates the features back into a 3D voxel space.

Figure 3: Multi-view renderings of 3D assets are tokenized, then aggregated into a 3D voxel-space representation.

3. Stable Training Without the Adversary

Training transformer autoencoders with GAN losses often leads to instability—the discriminator can overpower the generator. ATOKEN’s analysis showed ~87% of reconstruction error stemmed from the covariance component (style, texture), and only 13% from the mean (structure).

The challenges of GAN training and ATOKEN’s solution. (a) GAN training is unstable. (b) Most reconstruction error comes from texture/style. (c) Gram matrix loss optimizes these statistics directly.

Figure 4: GAN training is unstable for ATOKEN’s transformer autoencoders. Gram matrix loss stabilizes training by directly optimizing texture/style statistics.

The solution: replace adversarial training with Gram matrix loss to optimize feature correlations, combined with perceptual losses (L1, LPIPS, CLIP) for pixel accuracy and semantic fidelity.

4. A Progressive Training Curriculum

Training across modalities and tasks requires balance. ATOKEN uses a four-stage curriculum:

ATOKEN’s four-stage progressive training curriculum, starting with images and gradually adding video, 3D, and optional quantization.

Figure 5: Stages add capabilities incrementally—image, video, 3D, and optional discrete tokenization.

Image Foundation: Train with image reconstruction only.
Video Dynamics: Add video reconstruction/understanding.
3D Geometry: Add 3D assets—joint optimization across all.
Discrete Tokenization (optional): Quantize latents for compatibility with autoregressive models.

Key finding: multimodal training improved single-modality performance—image reconstruction got better after incorporating video and 3D.

Experiments and Results

A Unity of Modalities

A comprehensive comparison of visual tokenizers. ATOKEN is the only method to achieve strong performance across all tasks and modalities, supporting both continuous and discrete tokens.

Table 3: ATOKEN uniquely handles reconstruction and understanding for images, videos, and 3D.

For images: 0.21 rFID reconstruction, 82.2% zero-shot ImageNet accuracy. Comparable or superior to specialized models in video and 3D.

Image Tokenization

Image reconstruction performance on ImageNet and COCO. ATOKEN improves from Stage 1 to Stage 3, showing the benefit of multimodal training.

Table 4: Multimodal training boosts image reconstruction from 0.258 rFID (Stage 1) to 0.209 (Stage 3).

Qualitative comparison of image reconstruction. ATOKEN excels at fine details and legible text.

Figure 9: ATOKEN preserves fine textures and readable text better than competitors, even at higher compression ratios.

Image understanding performance—ATOKEN remains highly competitive with understanding-only encoders.

Table 5: ATOKEN maintains competitive semantic understanding (82.2% vs SigLIP2’s 83.4%).

Video and 3D Tokenization

Qualitative comparison of video reconstruction—ATOKEN produces temporally consistent frames.

Figure 10: ATOKEN matches specialized video models, ensuring temporal consistency.

Qualitative comparison of 3D reconstruction—ATOKEN shows superior color fidelity.

Figure 11: Unified training transfers color consistency from images/videos to 3D assets.

Downstream Applications

Multimodal LLMs

Replacing SlowFast-LLaVA’s vision encoder with ATOKEN yields competitive or better performance versus specialized encoders.

Integration into SlowFast-LLaVA—ATOKEN matches or outperforms specialized vision encoders.

Table 9: ATOKEN-powered multimodal LLM shows strong performance on vision-language tasks across model sizes.

Generative Models

Image Generation (Continuous Tokens): ATOKEN with Lightning-DiT achieves 1.56 gFID, close to specialized tokenizers.

Sample continuous-token images from Lightning-DiT + ATOKEN.

Figure 12: ImageNet generation samples using ATOKEN’s continuous tokens.

Image Generation (Discrete Tokens): ATOKEN with TokenBridge autoregressive model achieves 2.23 gFID, outperforming other unified tokenizers.

Sample discrete-token images from TokenBridge + ATOKEN.

Figure 13: ImageNet generation samples using ATOKEN’s discrete tokens.

Text-to-Video Generation: In controlled comparisons, ATOKEN matches specialized video tokenizers on T2V benchmarks.
Image-to-3D Synthesis: ATOKEN tokens support image-conditioned 3D generation.

Sample 3D assets generated from single images using ATOKEN’s discrete tokens.

Figure 14: Image-to-3D generation outputs using ATOKEN discrete tokens.

Conclusion: A Universal Visual Language Is Within Reach

ATOKEN represents a breakthrough in unified visual representation. By combining:

Sparse 4D latent space
Flexible transformer architecture
Stable adversarial-free training
Progressive multimodal curriculum

…it achieves both high-fidelity reconstruction and strong semantic understanding across images, videos, and 3D.

The crucial insight: unification does not require sacrificing performance. Training across modalities can produce synergistic gains—learning temporal dynamics and 3D geometry enhances image understanding and reconstruction.

Much like BPE tokenization catalyzed LLM generalization, unified visual tokenizers like ATOKEN could become the foundation for “omnimodels” that seamlessly perceive, reason about, and generate across the full visual spectrum—bringing visual AI closer to the generalized versatility we see in language models today.

Introduction: The Quest for a Universal Language of Vision#

Background: The Fragmented World of Visual AI#

The Great Divide: Reconstruction vs. Understanding#

A Tower of Babel: Images, Videos, and 3D#

The Core Method: How ATOKEN Unifies Vision#

1. A Unified 4D Latent Space#

2. A Pure Transformer Architecture#

3. Stable Training Without the Adversary#

4. A Progressive Training Curriculum#

Experiments and Results#

A Unity of Modalities#

Image Tokenization#

Video and 3D Tokenization#

Downstream Applications#

Multimodal LLMs#

Generative Models#

Conclusion: A Universal Visual Language Is Within Reach#