[CADDreamer: CAD Object Generation from Single-view Images 🔗](https://arxiv.org/abs/2502.20732)

From Pixels to Parts: How CADDreamer Generates Editable CAD Models from Single Images

Introduction: The Gap Between AI Art and Engineering In the last few years, generative AI has transformed how we visualize ideas. Tools like Midjourney or Stable Diffusion can conjure photorealistic scenes from a text prompt, and recent breakthroughs in 3D generation—like DreamFusion or Wonder3D—can turn a single 2D image into a rotating 3D asset. However, if you are an engineer, a product designer, or a game developer, you likely face a frustrating reality: generated 3D meshes are often useless for manufacturing. ...

2025-02 · 9 min · 1851 words
[Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Cheng_Breaking_the_Memory_Barrier_of_Contrastive_Loss_via_Tile-Based_Strategy_CVPR_2025_paper.pdf)

How to Train CLIP with Infinite Batch Sizes: Breaking the Memory Barrier

How to Train CLIP with Infinite Batch Sizes: Breaking the Memory Barrier In the world of modern AI, specifically in Representation Learning, there is a recurring theme: bigger is usually better. This is particularly true for contrastive learning models like CLIP (Contrastive Language-Image Pre-training). The secret sauce behind these models isn’t just the architecture; it’s the data, and more importantly, how much data the model sees at once. Research has consistently shown that larger batch sizes lead to better performance. A larger batch provides a more diverse set of “negative” samples (images that don’t match the text), forcing the model to learn much sharper, more discriminative features. ...

10 min · 1961 words
[Boost Your Human Image Generation Model via Direct Preference Optimization 🔗](https://arxiv.org/abs/2405.20216)

Crossing the Uncanny Valley: How HG-DPO Uses Real Images to Train Better Diffusion Models

Introduction We have all seen them: AI-generated portraits that look almost right, but something is off. Perhaps the skin texture is too plastic, the eyes lack a certain spark, or the anatomy twists in ways human bones simply shouldn’t. Despite the massive leaps in diffusion models like Stable Diffusion, generating truly photorealistic humans remains one of the hardest challenges in computer vision. The core issue often lies in how these models are fine-tuned. Typically, researchers use methods like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). These methods train the model by showing it two generated images—one “good” and one “bad”—and telling it to prefer the good one. But there is a ceiling to this approach: if the model’s “good” image is still artificial and flawed, the model is only learning to be the “best of a bad bunch.” It isn’t learning what real looks like. ...

2024-05 · 8 min · 1540 words
[Blurred LiDAR for Sharper 3D: Robust Handheld 3D Scanning with Diffuse LiDAR and RGB 🔗](https://arxiv.org/abs/2411.19474)

Why Blurry LiDAR and RGB Are the Future of Handheld 3D Scanning

In the world of computer vision and robotics, 3D reconstruction is the holy grail. Whether it’s a robot navigating a warehouse, a VR headset mapping your living room, or a Mars rover scanning a dune, the ability to turn the real world into a digital 3D model is critical. For years, the gold standard for handheld scanning (like what you might find on a high-end smartphone) has been a combination of an RGB camera and a sparse LiDAR sensor. This setup works reasonably well in perfect conditions. But the real world isn’t perfect. We encounter dark rooms, white textureless walls, and black objects that absorb light. In these “challenging” scenarios, traditional RGB-based reconstruction fails because it can’t “see” features, and sparse LiDAR fails because it doesn’t capture enough data points to fill in the gaps. ...

2024-11 · 9 min · 1708 words
[BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing 🔗](https://arxiv.org/abs/2504.01786)

Can AI Master Blender? Inside BlenderGym and the Quest for Automated 3D Editing

The world of 3D graphics—the backbone of modern video games, blockbuster movies, and architectural visualization—is notoriously complex. Creating a photorealistic scene isn’t just about artistic vision; it requires technical mastery of sophisticated software like Blender, Maya, or Unreal Engine. An artist doesn’t just “draw” a 3D chair; they manipulate geometry nodes, adjust material shaders, tweak lighting coordinates, and wrangle with physics simulations. Because this process is so time-consuming and specialized, researchers have been racing to automate it using Artificial Intelligence. We’ve seen the rise of Vision-Language Models (VLMs) that can look at an image and understand what’s in it. The dream is simple: tell an AI, “Make the lights dimmer and turn that wooden table into glass,” and have it execute the task instantly. ...

2025-04 · 9 min · 1765 words
[Balanced Rate-Distortion Optimization in Learned Image Compression 🔗](https://arxiv.org/abs/2502.20161)

Balancing Act: How Multi-Objective Optimization Boosts Learned Image Compression

Balancing Act: How Multi-Objective Optimization Boosts Learned Image Compression In the world of digital media, we are constantly fighting a tug-of-war. On one side, we want high-quality images that look crisp and true to life (Low Distortion). On the other, we want files that are small enough to stream, store, and share instantly (Low Rate). This trade-off is the heart of image compression. Traditional codecs like JPEG or HEVC solve this with hand-tuned engineering. But recently, Learned Image Compression (LIC)—using deep neural networks to compress images—has started to outperform these traditional methods. LIC models learn from data how to best represent an image. ...

2025-02 · 9 min · 1754 words
[BWFormer: Building Wireframe Reconstruction from Airborne LiDAR Point Cloud with Transformer 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Liu_BWFormer_Building_Wireframe_Reconstruction_from_Airborne_LiDAR_Point_Cloud_with_CVPR_2025_paper.pdf)

From Sparse Clouds to Sharp Edges: Reconstructing 3D Buildings with BWFormer

Introduction Imagine trying to draw a precise blueprint of a house, but all you have is a grainy, satellite-like scan taken from a plane flying overhead. Some parts of the roof are missing, trees are blocking the walls, and the data is just a collection of scattered dots. This is the reality of reconstructing 3D building models from airborne LiDAR (Light Detection and Ranging) point clouds. Building reconstruction is a cornerstone technology for smart cities, autonomous driving, and Virtual Reality/Augmented Reality (VR/AR). While we have become good at capturing data, turning that raw, noisy data into clean, lightweight “wireframes” (skeletal representations of geometry) remains a massive challenge. ...

8 min · 1597 words
[BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance 🔗](https://arxiv.org/abs/2502.19694)

Cleaning Up the Streets: How BEVDiffuser Enhances Autonomous Driving Perception without Slowing It Down

Introduction Imagine driving down a highway at night in pouring rain. Your eyes strain to distinguish between a parked car on the shoulder and a shadow, or between a distant streetlight and an oncoming vehicle. Now, imagine you are a computer algorithm trying to do the same thing. In autonomous driving, the vehicle’s “brain” typically relies on a Bird’s-Eye-View (BEV) representation. This is a top-down, grid-like map of the surroundings generated from onboard cameras and LiDAR sensors. This map is the foundation for everything the car does next: detecting objects, predicting movement, and planning a path. ...

2025-02 · 8 min · 1588 words
[BADGR: Bundle Adjustment Diffusion Conditioned by GRadients for Wide-Baseline Floor Plan Reconstruction 🔗](https://arxiv.org/abs/2503.19340)

When Geometry Meets Generative AI: A Deep Dive into BADGR

Introduction Imagine you are standing in the middle of a room, holding a camera, and taking a \(360^{\circ}\) panoramic photo. Now, walk into the next room and take another one. Can you reconstruct the entire floor plan of the house—accurate to the centimeter—just from those two photos? This is the problem of Wide-Baseline Floor Plan Reconstruction, and it is notoriously difficult. Unlike a video feed where camera frames are millimeters apart, wide-baseline images are taken far apart (often in different rooms). The visual overlap is small, and traditional computer vision algorithms struggle to stitch these “islands” of data together into a coherent map. ...

2025-03 · 9 min · 1751 words
[Assessing and Learning Alignment of Unimodal Vision and Language Models 🔗](https://arxiv.org/abs/2412.04616)

Breaking Up with CLIP - How to Build Better Vision-Language Models with 94% Less Data

For the last few years, the recipe for building a Vision-Language Model (VLM) has been relatively static. If you wanted a model that understood how images and text relate—like OpenAI’s CLIP—you needed to collect a massive dataset of hundreds of millions of image-text pairs and train two neural networks (one for vision, one for text) from scratch. This process is computationally expensive, data-hungry, and often results in models that are “jacks of all trades, masters of none.” The vision encoder might be decent, and the text encoder might be passable, but neither is state-of-the-art compared to models dedicated to a single modality. ...

2024-12 · 9 min · 1911 words
[ArcPro: Architectural Programs for Structured 3D Abstraction of Sparse Points 🔗](https://arxiv.org/abs/2503.02745)

From Chaos to Code: Transforming Sparse Point Clouds into Structured 3D Buildings with ArcPro

From Chaos to Code: Transforming Sparse Point Clouds into Structured 3D Buildings with ArcPro Imagine flying a drone over a city to map it. The drone captures thousands of images, and through photogrammetry, you generate a 3D representation of the scene. What you get back, however, is rarely a pristine, CAD-ready model. Instead, you get a “point cloud”—a chaotic swarm of millions of floating dots. If the scan is high-quality, the dots are dense, and you can see the surfaces clearly. But in the real world, data is often messy. Aerial scans can be sparse (containing very few points), noisy (points are in the wrong place), or incomplete (entire walls might be missing due to occlusion). ...

2025-03 · 10 min · 2074 words
[AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities 🔗](https://arxiv.org/abs/2412.14123)

AnySat: The Universal Translator for Satellite Imagery

Introduction In the world of Computer Vision, things are surprisingly orderly. Whether you are training a model on ImageNet or your own collection of vacation photos, the data usually looks the same: standard RGB images, captured by standard cameras, often resized to a standard resolution (like \(224 \times 224\)). This uniformity has allowed models like ResNet and Vision Transformers (ViTs) to become powerful, general-purpose engines. But if you look at the planet from space, that order collapses into chaos. ...

2024-12 · 10 min · 1989 words
[Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Kumari_Annotation_Ambiguity_Aware_Semi-Supervised_Medical_Image_Segmentation_CVPR_2025_paper.pdf)

Embracing Uncertainty — How AmbiSSL Revolutionizes Medical Image Segmentation

Introduction In the world of medical diagnostics, there is rarely a single, indisputable truth. When three different radiologists look at a CT scan of a lung nodule or an MRI of a tumor, they will likely draw three slightly different boundaries around the lesion. This isn’t an error; it is the inherent ambiguity of medical imaging caused by blurred edges, low contrast, and complex anatomy. However, traditional Deep Learning models treat segmentation as a deterministic task. They are trained to output a single “correct” mask. This creates a disconnect between AI outputs and clinical reality. Furthermore, training these models requires massive datasets with pixel-perfect annotations, which are incredibly expensive and time-consuming to obtain. ...

9 min · 1724 words
[All-directional Disparity Estimation for Real-world QPD Images 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Yu_All-directional_Disparity_Estimation_for_Real-world_QPD_Images_CVPR_2025_paper.pdf)

Unlocking Depth in Smartphone Cameras: Deep Learning for Quad Photodiode Sensors

If you have bought a high-end smartphone in the last few years, you have likely benefited from the rapid evolution of image sensors. The quest for instantaneous autofocus has driven hardware engineers to move from standard sensors to Dual-Pixel (DP) sensors, and more recently, to Quad Photodiode (QPD) sensors. While QPD sensors are designed primarily to make autofocus lightning-fast, they hide a secondary potential: depth estimation. Just as our two eyes allow us to perceive depth through stereo vision, the sub-pixels in a QPD sensor can theoretically function as tiny, multi-view cameras. However, extracting accurate depth (or disparity) from these sensors is notoriously difficult due to physical limitations like uneven lighting and microscopic distances between pixels. ...

8 min · 1603 words
[All-Optical Nonlinear Diffractive Deep Network for Ultrafast Image Denoising 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhou_All-Optical_Nonlinear_Diffractive_Deep_Network_for_Ultrafast_Image_Denoising_CVPR_2025_paper.pdf)

Denoising at the Speed of Light—How N3DNet Revolutionizes Optical Computing

Introduction In the world of computer vision and signal processing, noise is the enemy. Whether it’s grainy low-light photographs, medical imaging artifacts, or signal degradation in fiber optic cables, “denoising” is a fundamental step in making data usable. Traditionally, we rely on electronic chips (CPUs and GPUs) to clean up these images. We run heavy algorithms—from classical Weiner filtering to modern Convolutional Neural Networks (CNNs)—to guess what the clean image should look like. While effective, this approach hits a hard wall: latency and power consumption. Electronic computing involves moving electrons through transistors, which generates heat and takes time. When you need to process data in real-time, such as in high-speed fiber optic communications, electronic chips often become the bottleneck. ...

8 min · 1561 words
[All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages 🔗](https://arxiv.org/abs/2411.16508)

Beyond English: Why AI Needs to Understand the World's 100 Languages (ALM-bench)

Introduction Imagine showing an AI a photo of a bustling street festival. If the festival is Mardi Gras in New Orleans, most top-tier AI models will instantly recognize the beads, the floats, and the context. But what if that photo depicts Mela Chiraghan in Pakistan or a traditional Angampora martial arts display in Sri Lanka? This is where the cracks in modern Artificial Intelligence begin to show. While Large Multimodal Models (LMMs)—systems that can see images and process text simultaneously—have made incredible leaps in capability, they possess a significant blind spot: the majority of the world’s cultures and languages. ...

2024-11 · 7 min · 1336 words
[Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging 🔗](https://arxiv.org/abs/2505.10649)

Why AI Forgets: Solving Catastrophic Forgetting in Medical Imaging

Why AI Forgets: Solving Catastrophic Forgetting in Medical Imaging Artificial Intelligence has made massive strides in medical diagnostics, particularly in the analysis of pathology slides. However, there is a hidden problem in the deployment of these systems: they are static. In the fast-moving world of medicine, new diseases are discovered, new subtypes are classified, and scanning equipment is upgraded. Ideally, we want an AI model that learns continuously, adapting to new data without losing its ability to recognize previous conditions. This is the realm of Continual Learning (CL). But when researchers apply standard CL techniques to pathology, they run into a wall known as catastrophic forgetting. The model learns the new task but completely forgets the old one. ...

2025-05 · 9 min · 1718 words
[AdaCM2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction 🔗](https://arxiv.org/abs/2411.12593)

Breaking the Memory Wall: How AdaCM² Enables AI to Watch and Understand Full-Length Movies

Imagine asking an AI to watch a two-hour movie and then asking, “What was the number on the jersey of the man in the background at the very end?” or “How did the protagonist’s relationship with her sister evolve from the first scene to the last?” For most current multimodal AI models, this is an impossible task. While models like GPT-4V or VideoLLaMA are impressive at analyzing short clips (typically 5 to 15 seconds), they hit a hard limit when the video stretches into minutes or hours. This limit is known as the Memory Wall. As a video gets longer, the number of visual “tokens” (pieces of information) the model must hold in its memory grows massively. Eventually, the GPU runs out of memory (OOM), or the model gets overwhelmed by noise and forgets the context. ...

2024-11 · 9 min · 1711 words
[Active Hyperspectral Imaging Using an Event Camera 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Yu_Active_Hyperspectral_Imaging_Using_an_Event_Camera_CVPR_2025_paper.pdf)

Breaking the Speed Limit of Color - How Event Cameras Are Revolutionizing Hyperspectral Imaging

Introduction: The Invisible World and the Iron Triangle Human vision is trichromatic; we perceive the world through a mix of red, green, and blue. However, the physical world is far richer. Every material interacts with light across a continuous spectrum of wavelengths, creating a unique “fingerprint” invisible to the naked eye. Hyperspectral Imaging (HSI) is the technology that allows us to see these fingerprints. By capturing hundreds of spectral bands instead of just three, HSI can distinguish between real and fake plants, detect diseases in tissue, or classify minerals in real-time. ...

9 min · 1796 words
[ARM: Appearance Reconstruction Model for Relightable 3D Generation 🔗](https://arxiv.org/abs/2411.10825)

Beyond Baked Lighting: How ARM Decouples Shape and Material for Relightable 3D

Introduction In the rapidly evolving world of Generative AI, creating a 3D object from a single 2D image is something of a “Holy Grail.” We have seen tremendous progress with models that can turn a picture of a cat into a 3D mesh in seconds. However, if you look closely at the results of most current state-of-the-art models, you will notice a flaw: they look great from the original camera angle, but they often fail to react realistically to light. ...

2024-11 · 9 min · 1707 words