[ArcPro: Architectural Programs for Structured 3D Abstraction of Sparse Points 🔗](https://arxiv.org/abs/2503.02745)

From Chaos to Code: Transforming Sparse Point Clouds into Structured 3D Buildings with ArcPro

From Chaos to Code: Transforming Sparse Point Clouds into Structured 3D Buildings with ArcPro Imagine flying a drone over a city to map it. The drone captures thousands of images, and through photogrammetry, you generate a 3D representation of the scene. What you get back, however, is rarely a pristine, CAD-ready model. Instead, you get a “point cloud”—a chaotic swarm of millions of floating dots. If the scan is high-quality, the dots are dense, and you can see the surfaces clearly. But in the real world, data is often messy. Aerial scans can be sparse (containing very few points), noisy (points are in the wrong place), or incomplete (entire walls might be missing due to occlusion). ...

2025-03 · 10 min · 2074 words
[AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities 🔗](https://arxiv.org/abs/2412.14123)

AnySat: The Universal Translator for Satellite Imagery

Introduction In the world of Computer Vision, things are surprisingly orderly. Whether you are training a model on ImageNet or your own collection of vacation photos, the data usually looks the same: standard RGB images, captured by standard cameras, often resized to a standard resolution (like \(224 \times 224\)). This uniformity has allowed models like ResNet and Vision Transformers (ViTs) to become powerful, general-purpose engines. But if you look at the planet from space, that order collapses into chaos. ...

2024-12 · 10 min · 1989 words
[Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Kumari_Annotation_Ambiguity_Aware_Semi-Supervised_Medical_Image_Segmentation_CVPR_2025_paper.pdf)

Embracing Uncertainty — How AmbiSSL Revolutionizes Medical Image Segmentation

Introduction In the world of medical diagnostics, there is rarely a single, indisputable truth. When three different radiologists look at a CT scan of a lung nodule or an MRI of a tumor, they will likely draw three slightly different boundaries around the lesion. This isn’t an error; it is the inherent ambiguity of medical imaging caused by blurred edges, low contrast, and complex anatomy. However, traditional Deep Learning models treat segmentation as a deterministic task. They are trained to output a single “correct” mask. This creates a disconnect between AI outputs and clinical reality. Furthermore, training these models requires massive datasets with pixel-perfect annotations, which are incredibly expensive and time-consuming to obtain. ...

9 min · 1724 words
[All-directional Disparity Estimation for Real-world QPD Images 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Yu_All-directional_Disparity_Estimation_for_Real-world_QPD_Images_CVPR_2025_paper.pdf)

Unlocking Depth in Smartphone Cameras: Deep Learning for Quad Photodiode Sensors

If you have bought a high-end smartphone in the last few years, you have likely benefited from the rapid evolution of image sensors. The quest for instantaneous autofocus has driven hardware engineers to move from standard sensors to Dual-Pixel (DP) sensors, and more recently, to Quad Photodiode (QPD) sensors. While QPD sensors are designed primarily to make autofocus lightning-fast, they hide a secondary potential: depth estimation. Just as our two eyes allow us to perceive depth through stereo vision, the sub-pixels in a QPD sensor can theoretically function as tiny, multi-view cameras. However, extracting accurate depth (or disparity) from these sensors is notoriously difficult due to physical limitations like uneven lighting and microscopic distances between pixels. ...

8 min · 1603 words
[All-Optical Nonlinear Diffractive Deep Network for Ultrafast Image Denoising 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhou_All-Optical_Nonlinear_Diffractive_Deep_Network_for_Ultrafast_Image_Denoising_CVPR_2025_paper.pdf)

Denoising at the Speed of Light—How N3DNet Revolutionizes Optical Computing

Introduction In the world of computer vision and signal processing, noise is the enemy. Whether it’s grainy low-light photographs, medical imaging artifacts, or signal degradation in fiber optic cables, “denoising” is a fundamental step in making data usable. Traditionally, we rely on electronic chips (CPUs and GPUs) to clean up these images. We run heavy algorithms—from classical Weiner filtering to modern Convolutional Neural Networks (CNNs)—to guess what the clean image should look like. While effective, this approach hits a hard wall: latency and power consumption. Electronic computing involves moving electrons through transistors, which generates heat and takes time. When you need to process data in real-time, such as in high-speed fiber optic communications, electronic chips often become the bottleneck. ...

8 min · 1561 words
[All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages 🔗](https://arxiv.org/abs/2411.16508)

Beyond English: Why AI Needs to Understand the World's 100 Languages (ALM-bench)

Introduction Imagine showing an AI a photo of a bustling street festival. If the festival is Mardi Gras in New Orleans, most top-tier AI models will instantly recognize the beads, the floats, and the context. But what if that photo depicts Mela Chiraghan in Pakistan or a traditional Angampora martial arts display in Sri Lanka? This is where the cracks in modern Artificial Intelligence begin to show. While Large Multimodal Models (LMMs)—systems that can see images and process text simultaneously—have made incredible leaps in capability, they possess a significant blind spot: the majority of the world’s cultures and languages. ...

2024-11 · 7 min · 1336 words
[Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging 🔗](https://arxiv.org/abs/2505.10649)

Why AI Forgets: Solving Catastrophic Forgetting in Medical Imaging

Why AI Forgets: Solving Catastrophic Forgetting in Medical Imaging Artificial Intelligence has made massive strides in medical diagnostics, particularly in the analysis of pathology slides. However, there is a hidden problem in the deployment of these systems: they are static. In the fast-moving world of medicine, new diseases are discovered, new subtypes are classified, and scanning equipment is upgraded. Ideally, we want an AI model that learns continuously, adapting to new data without losing its ability to recognize previous conditions. This is the realm of Continual Learning (CL). But when researchers apply standard CL techniques to pathology, they run into a wall known as catastrophic forgetting. The model learns the new task but completely forgets the old one. ...

2025-05 · 9 min · 1718 words
[AdaCM2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction 🔗](https://arxiv.org/abs/2411.12593)

Breaking the Memory Wall: How AdaCM² Enables AI to Watch and Understand Full-Length Movies

Imagine asking an AI to watch a two-hour movie and then asking, “What was the number on the jersey of the man in the background at the very end?” or “How did the protagonist’s relationship with her sister evolve from the first scene to the last?” For most current multimodal AI models, this is an impossible task. While models like GPT-4V or VideoLLaMA are impressive at analyzing short clips (typically 5 to 15 seconds), they hit a hard limit when the video stretches into minutes or hours. This limit is known as the Memory Wall. As a video gets longer, the number of visual “tokens” (pieces of information) the model must hold in its memory grows massively. Eventually, the GPU runs out of memory (OOM), or the model gets overwhelmed by noise and forgets the context. ...

2024-11 · 9 min · 1711 words
[Active Hyperspectral Imaging Using an Event Camera 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Yu_Active_Hyperspectral_Imaging_Using_an_Event_Camera_CVPR_2025_paper.pdf)

Breaking the Speed Limit of Color - How Event Cameras Are Revolutionizing Hyperspectral Imaging

Introduction: The Invisible World and the Iron Triangle Human vision is trichromatic; we perceive the world through a mix of red, green, and blue. However, the physical world is far richer. Every material interacts with light across a continuous spectrum of wavelengths, creating a unique “fingerprint” invisible to the naked eye. Hyperspectral Imaging (HSI) is the technology that allows us to see these fingerprints. By capturing hundreds of spectral bands instead of just three, HSI can distinguish between real and fake plants, detect diseases in tissue, or classify minerals in real-time. ...

9 min · 1796 words
[ARM: Appearance Reconstruction Model for Relightable 3D Generation 🔗](https://arxiv.org/abs/2411.10825)

Beyond Baked Lighting: How ARM Decouples Shape and Material for Relightable 3D

Introduction In the rapidly evolving world of Generative AI, creating a 3D object from a single 2D image is something of a “Holy Grail.” We have seen tremendous progress with models that can turn a picture of a cat into a 3D mesh in seconds. However, if you look closely at the results of most current state-of-the-art models, you will notice a flaw: they look great from the original camera angle, but they often fail to react realistically to light. ...

2024-11 · 9 min · 1707 words
[ALIEN: Implicit Neural Representations for Human Motion Prediction under Arbitrary Latency 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Wei_ALIEN_Implicit_Neural_Representations_for_Human_Motion_Prediction_under_Arbitrary_CVPR_2025_paper.pdf)

Solving the Lag: How ALIEN Predicts Human Motion Despite Arbitrary Network Latency

Introduction Imagine you are playing a high-stakes match of virtual reality table tennis against a friend halfway across the world. You swing your controller, expecting your avatar to mirror the movement instantly. But there’s a catch: the internet connection fluctuates. Your swing data travels through a Wide Area Network (WAN), encountering unpredictable delays before reaching the game server or your opponent’s display. In the world of computer vision and robotics, this is known as the latency problem. Whether it is a surrogate robot replicating a human’s movements or a metaverse avatar interacting with a virtual environment, time delays caused by network transmission and algorithm execution are inevitable. ...

9 min · 1775 words
[AIpparel: A Multimodal Foundation Model for Digital Garments 🔗](https://arxiv.org/abs/2412.03937)

AIpparel: The First Foundation Model for Designing Digital Fashion

Fashion is an intrinsic part of human culture, serving as a shield against the elements and a canvas for self-expression. However, the backend of the fashion industry—specifically the creation of sewing patterns—remains a surprisingly manual and technical bottleneck. While generative AI has revolutionized 2D image creation (think Midjourney or DALL-E), generating manufacturable garments is a different beast entirely. A sewing pattern isn’t just a picture of a dress; it is a complex set of 2D panels with precise geometric relationships that must stitch together to form a 3D shape. To date, AI models for fashion have been “single-modal,” meaning they could perhaps turn a 3D scan into a pattern, or text into a pattern, but they lacked the flexibility to understand images, text, and geometry simultaneously. ...

2024-12 · 7 min · 1382 words
[A Unified Approach to Interpreting Self-supervised Pre-training Methods for 3D Point Clouds via Interactions 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Li_A_Unified_Approach_to_Interpreting_Self-supervised_Pre-training_Methods_for_3D_CVPR_2025_paper.pdf)

Why Does 3D Pre-training Work? Unlocking the Black Box with Game Theory

In the rapidly evolving world of 3D computer vision, self-supervised pre-training has become the golden standard. Whether you are building perception systems for autonomous vehicles or analyzing 3D medical scans, the recipe for success usually involves taking a massive, unlabeled dataset, pre-training a Deep Neural Network (DNN) on it, and then fine-tuning it for your specific task. We know that it works. Pre-training consistently boosts performance. But why does it work? ...

10 min · 2003 words
[4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion 🔗](https://arxiv.org/abs/2412.04462)

Beyond 2D: How 4Real-Video Generates Consistent 4D Worlds in Seconds

Imagine you are watching a video of a cat playing with a toy. In a standard video, you are a passive observer, locked into the camera angle the videographer chose. Now, imagine you could pause that video at any second, grab the screen, and rotate the camera around the frozen cat to see the toy from the back. Then, you press play, and the video continues from that new angle. ...

2024-12 · 8 min · 1646 words
[3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion 🔗](https://arxiv.org/abs/2409.12957)

3DTopia-XL: The Future of High-Fidelity 3D Asset Generation with Primitive Diffusion

The demand for high-quality 3D assets is exploding. From the immersive worlds of video games and virtual reality to the practical applications of architectural visualization and film production, the need for detailed, realistic 3D models is higher than ever. Traditionally, creating these assets has been a labor-intensive bottleneck, requiring skilled artists to sculpt geometry, paint textures, and tune material properties manually. In recent years, Generative AI has promised to automate this pipeline. We’ve seen models that can turn text into 3D shapes or turn a single image into a rotating mesh. However, a significant gap remains between what AI generates and what professional graphics engines actually need. Most current AI models produce “baked” assets—meshes with color painted directly onto the vertices. They often look like plastic toys or clay models, lacking the complex material properties (like how shiny metal is versus how matte rubber is) required for photorealistic rendering. ...

2024-09 · 8 min · 1577 words
[3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes 🔗](https://arxiv.org/abs/2411.14974)

Beyond Gaussians: Why 3D Smooth Convexes are the Future of Radiance Fields

Introduction In the rapidly evolving world of computer vision, the quest to reconstruct reality inside a computer has seen massive leaps in just a few years. We started with photogrammetry, moved to the revolutionary Neural Radiance Fields (NeRFs), and most recently arrived at 3D Gaussian Splatting (3DGS). 3DGS changed the game by allowing for real-time rendering and fast training speeds that NeRFs struggled to achieve. It represents a scene not as a continuous volume, but as millions of discrete 3D Gaussian “blobs.” While this works incredibly well for organic, fuzzy structures, it hits a wall when dealing with the man-made world. Look around you—walls, tables, screens, and buildings are defined by sharp edges and flat surfaces. Gaussians, by their nature, are soft, round, and diffuse. Trying to represent a sharp cube with round blobs is like trying to build a Lego house out of water balloons; you need an excessive amount of them to approximate the flat sides, and it’s still never quite perfect. ...

2024-11 · 10 min · 1998 words
[Zero-Shot Monocular Scene Flow Estimation in the Wild 🔗](https://arxiv.org/abs/2501.10357)

Taming the Wild - A New Standard for Zero-Shot Monocular Scene Flow

Introduction Imagine you are looking at a standard video clip. It’s a 2D sequence of images. Your brain, processing this monocular (single-eye) view, instantly understands two things: the 3D structure of the scene (what is close, what is far) and the motion of objects (where things are moving in that 3D space). For computer vision models, replicating this human intuition is an incredibly difficult task known as Monocular Scene Flow (MSF). While we have seen massive leaps in Artificial Intelligence regarding static depth estimation or 2D optical flow, estimating dense 3D motion from a single camera remains an elusive frontier. ...

2025-01 · 8 min · 1609 words
[VGGT: Visual Geometry Grounded Transformer 🔗](https://arxiv.org/abs/2503.11651)

One Pass to Rule Them All: Understanding VGGT for Instant 3D Reconstruction

Introduction For decades, the field of computer vision has chased a specific “Holy Grail”: taking a handful of flat, 2D photos scattered around a scene and instantly transforming them into a coherent 3D model. Traditionally, this process—known as Structure-from-Motion (SfM)—has been a slow, mathematical grind. It involves detecting features, matching them across images, solving complex geometric equations to find camera positions, and then running iterative optimization algorithms like Bundle Adjustment to refine everything. While effective, it is computationally expensive and often brittle. ...

2025-03 · 9 min · 1886 words
[UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming 🔗](https://arxiv.org/abs/2307.16375)

Breaking the Distributed Bottleneck: How UniAP Unifies Parallel Training Strategies

If you have ever tried to train a massive Large Language Model (LLM) like Llama or a vision giant like ViT, you know the struggle: a single GPU simply doesn’t cut it. To train these behemoths, we need distributed learning across clusters of GPUs. But here is the catch: simply having a cluster isn’t enough. You have to decide how to split the model. Do you split the data? Do you split the layers? Do you split the tensors inside the layers? ...

2023-07 · 8 min · 1647 words
[The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition 🔗](https://arxiv.org/abs/2502.21201)

Can AI See the Chimp for the Trees? Mitigating Background Bias in Wildlife Monitoring

Introduction Imagine you are training a computer vision model to recognize a chimpanzee climbing a tree. You feed it thousands of hours of video footage. The model achieves high accuracy, and you are thrilled. But then, you test it on a video of an empty forest with no chimpanzee in sight, and the model confidently predicts: “Climbing.” Why does this happen? The model has fallen into a trap known as shortcut learning. Instead of learning the complex motion of the limbs or the texture of the fur, the model took the path of least resistance: it learned that “vertical tree trunks” usually equal “climbing.” It memorized the background, not the behavior. ...

2025-02 · 9 min · 1735 words