Introduction
In the world of Computer Vision, things are surprisingly orderly. Whether you are training a model on ImageNet or your own collection of vacation photos, the data usually looks the same: standard RGB images, captured by standard cameras, often resized to a standard resolution (like \(224 \times 224\)). This uniformity has allowed models like ResNet and Vision Transformers (ViTs) to become powerful, general-purpose engines.
But if you look at the planet from space, that order collapses into chaos.
Earth Observation (EO) data is notoriously messy. Satellites orbit at different altitudes, carrying sensors that see the world in vastly different ways. You might have an optical image from Sentinel-2 with 10-meter resolution, a radar scan from Sentinel-1 (which sees through clouds but looks like static noise to the human eye), and a commercial drone image with 0.2-meter resolution—all capturing the same forest.
Traditionally, building AI for this domain meant building silos. You trained one model for Sentinel-2 and a completely different one for aerial drones. If you wanted to combine them, you had to clumsily resize images, losing critical details or introducing artifacts.
Enter AnySat.

AnySat is a new foundation model architecture designed to break these silos. As shown in the figure above, it proposes a radical shift: a single model capable of digesting highly heterogeneous data—varying resolutions, scales, and modalities—simultaneously. Whether it’s tracking deforestation using radar or classifying crops using high-resolution optical imagery, AnySat learns from all of them at once to build a unified understanding of the Earth.
In this post, we will deconstruct how AnySat manages to be a “universal translator” for geospatial data, diving deep into its scale-adaptive architecture and its unique self-supervised training method.
The Challenge: A Tower of Babel in Space
To understand why AnySat is significant, we first need to appreciate the difficulty of the problem. In standard Deep Learning, we usually assume a fixed input tensor size. A standard Vision Transformer (ViT), for example, splits an image into fixed patches (e.g., \(16 \times 16\) pixels).
In Earth Observation, a \(16 \times 16\) pixel patch means something completely different depending on the sensor:
- Sentinel-2: A patch covers \(160 \times 160\) meters.
- Aerial Drone: A patch covers \(3.2 \times 3.2\) meters.
- MODIS Satellite: A patch covers nearly \(4 \times 4\) kilometers.
If you feed these directly into a standard model, the model has no concept of physical scale. A house in a drone image looks like a few pixels, while in a satellite image, an entire city block might be a single pixel. Previous approaches tried to solve this by rescaling everything to a common resolution, but this is computationally wasteful and degrades data quality.
Furthermore, the modalities (the types of sensor data) are fundamentally different. Optical cameras capture reflected sunlight (chemistry), while Synthetic Aperture Radar (SAR) captures surface texture and moisture (physics). A foundation model needs to understand that a bright green pixel in an optical image and a high-backscatter pixel in a radar image might represent the same cornfield.
The AnySat Solution
AnySat tackles these challenges through two main innovations:
- Scale-Adaptive Patch Encoding: A way to ingest data of any resolution without resizing the original image.
- Multimodal JEPA: A self-supervised learning strategy that teaches the model to understand the semantics of the data by predicting missing information in feature space, rather than pixel space.
1. Scale-Adaptive Patch Encoding
The first hurdle is getting the data into the network. AnySat abandons the idea that an input image must have a fixed size. Instead, it processes tiles (large geographic areas) partitioned into patches.
However, because the resolution (meters per pixel) varies by sensor, the number of pixels inside a physical patch varies wildy. AnySat solves this with a hierarchical encoding scheme.

As illustrated in Figure 2, the process works like this:
- Physical Consistency: The model defines a patch size in meters (e.g., \(P \times P\) meters), not pixels. This ensures that the model always “thinks” in physical units.
- Sub-Patching: Because the pixel density varies, a patch from a high-resolution sensor will have many pixels, while a low-res sensor has few. AnySat splits the patch into sub-patches of fixed pixel size (\(\delta_m\)).
- Projection: Each sub-patch is processed by a specific Multi-Layer Perceptron (MLP) tailored to that sensor (the projector \(\phi^{\text{proj}}\)).
- Merging: A shared transformer (\(\phi^{\text{trans}}\)) aggregates all these sub-patches into a single vector representation for the whole patch.
The beauty of this approach is that the shared transformer doesn’t care how many sub-patches there are. Whether the input was a dense drone image (many sub-patches) or a coarse satellite image (few sub-patches), the output is a single, standardized embedding vector of size \(E\).
Mathematically, the encoding of a patch \(x_p^m\) (modality \(m\) at patch \(p\)) is represented as:

This architecture allows AnySat to handle inputs ranging from 0.2 meters/pixel to 250 meters/pixel using the same core network weights.
2. The Architecture of AnySat
Once the data is encoded into these standardized vectors, AnySat employs a Student-Teacher architecture to learn. The core idea is to train the model without human labels (Self-Supervised Learning) by playing a sophisticated game of “fill in the blanks.”

Figure 3 outlines the workflow. The architecture consists of:
- The Student: It sees a corrupted version of the data. Some patches are completely removed (“dropped”), and others have specific sensor modalities hidden (“masked”). Its job is to guess what’s missing.
- The Teacher: It sees the full, uncorrupted data. Its job is to provide the “correct answer.”
Crucially, the Teacher is not trained via backpropagation. Its weights are an Exponential Moving Average (EMA) of the Student’s weights. This ensures the Teacher is always slightly more stable and “mature” than the Student, providing a steady target for learning.
The model uses a Modality Combiner (\(\phi^{\text{comb}}\)) to merge information from different sensors (like optical and radar) covering the same patch. This results in a multimodal representation \(f^*\) used for the final prediction.

3. Training: Joint Embedding Predictive Architecture (JEPA)
Most self-supervised models in computer vision (like MAE) try to reconstruct the missing pixels. If you mask out a car, the model tries to draw the car pixel-by-pixel.
In Earth Observation, pixel reconstruction is dangerous. Two satellite images of the same forest taken one day apart might look totally different due to cloud shadows, sun angles, or atmospheric haze. If the model tries to reconstruct the exact pixels, it wastes capacity learning about clouds and shadows rather than the forest itself.
AnySat adopts the JEPA framework. Instead of predicting pixels, the Student tries to predict the feature embeddings produced by the Teacher.

The loss function calculates the distance between the Student’s predicted embedding and the Teacher’s actual embedding for the dropped patches (\(K\)).

By predicting in “latent space” (feature space), the model learns semantic consistency. It learns that “forest” is “forest,” regardless of whether a cloud shadow is passing over it in the specific image it’s trying to reconstruct.
4. Aligning the Modalities
Simply reconstructing missing parts isn’t enough when dealing with multi-sensor data. We need to ensure that the model understands that a radar image of a building corresponds to the optical image of that same building.
To enforce this, AnySat adds a Contrastive Loss. This loss function forces the embeddings of the same patch from different modalities (e.g., \(f^{optical}\) and \(f^{radar}\)) to be close to each other in vector space, while pushing apart embeddings from different patches.

This alignment is critical. It allows the model to transfer knowledge. If AnySat learns to identify a specific crop type using optical data, the contrastive loss helps it map that knowledge to radar patterns, improving performance even when optical data is missing (e.g., at night or during cloudy weather).
GeoPlex: A Dataset to Rule Them All
A versatile model requires versatile training data. The researchers compiled GeoPlex, a massive collection of 5 diverse multimodal datasets.

As shown in Figure 4, GeoPlex isn’t just large; it’s geographically and spectrally diverse. It covers:
- Resolutions: From 0.2m (aerial) to 250m (MODIS).
- Sensors: 11 distinct modalities, including Sentinel-1/2, Landsat, SPOT, and NAIP.
- Types: Single images and time-series data.

Table C details the sheer variety of data AnySat ingests. This diversity prevents the model from overfitting to the specific quirks of one sensor or one region (like the “greenness” of European forests vs. the Amazon).
For scale awareness, the model uses a specialized positional encoding that accounts for the physical size of the patch, ensuring the network knows “where” and “how big” the data it is looking at is.

Experimental Results
So, does this “Universal Translator” actually work? The researchers put AnySat to the test across 9 diverse downstream tasks, ranging from crop classification to flood segmentation.
Performance on GeoPlex
First, they evaluated the model on the test sets of the datasets included in GeoPlex. The results were impressive.

Figure 5 shows that AnySat (purple bars) consistently outperforms or matches state-of-the-art (SOTA) specialized models.
- Tree Species Classification: AnySat outperforms specialized models like OmniSat and DOFA.
- Semantic Segmentation: In tasks like identifying land cover (PASTIS, FLAIR), AnySat achieves superior Mean Intersection over Union (mIoU) scores.
This is notable because usually, “generalist” models trade off peak performance for versatility. AnySat seems to enjoy the best of both worlds: the robustness of a large foundation model and the precision of a specialist.
Generalization to External Datasets
The true test of a foundation model is how it handles data it has never seen before. The researchers tested AnySat on 6 external datasets that were not part of the training mix (GeoPlex).
The results in Table 1 are striking.

AnySat achieves state-of-the-art results on datasets like Sen1Floods11 (flood detection) and HLS Burn Scar, often outperforming models that were significantly larger or specifically trained on similar data (like Prithvi or SatMAE).
What makes this even more impressive is that AnySat can handle unseen sensor configurations. For example, the TimeSen2Crop dataset contains single-pixel time series (no spatial context), a format not explicitly present in GeoPlex. AnySat adapted seamlessly.
Visualizing the Capabilities
Numbers tell one story, but visual outputs tell another. Figure B demonstrates the semantic segmentation capabilities of the model.

The predictions (middle row) closely align with the ground truth (bottom row), even for complex shapes like agricultural parcels or floodwaters. The model handles the transition between the high-resolution texture of aerial imagery and the coarser blocks of Sentinel data without producing the “checkerboard” artifacts common in resized data.
Why does it work? (Ablation Studies)
The researchers performed ablation studies to understand which components drive this performance.

Table 2 highlights two key findings:
- Contrastive Loss Matters: Removing the contrastive loss (which aligns optical and radar data) dropped classification performance significantly. This confirms that learning the relationship between modalities is crucial for semantic understanding.
- JEPA vs. Random Dropping: The structured masking strategy of JEPA proved slightly better for segmentation tasks compared to random token dropping, reinforcing the value of the predictive architecture.
Summary of Performance
To visualize the dominance of AnySat, the researchers plotted its performance against SOTA methods on a radar chart.

As seen in Figure A, the purple line (AnySat) encompasses the orange line (SOTA) on almost every axis. Whether it’s fine-tuning (FT) or Linear Probing (LP)—where only the last layer is trained—AnySat provides a superior starting point for EO tasks.
Conclusion
AnySat represents a significant maturation in Earth Observation AI. It moves us away from the era of “one sensor, one model” toward a future where we can treat satellite data as a unified, continuous stream of information.
By combining Scale-Adaptive Encoders with the semantic power of JEPA and Contrastive Learning, AnySat solves the fundamental headache of EO data: heterogeneity. It allows researchers to train on whatever data they have—drones, old satellites, new radar constellations—and produce a model that is greater than the sum of its parts.
For students and researchers entering the field, AnySat demonstrates that the future of Remote Sensing isn’t just about launching better satellites; it’s about building smarter architectures that can finally make sense of the messy, multimodal picture of our changing planet.
](https://deep-paper.org/en/paper/2412.14123/images/cover.png)