Papers

[TinyFusion: Diffusion Transformers Learned Shallow 🔗](https://arxiv.org/abs/2412.01199)

TinyFusion: How to Shrink Diffusion Transformers Without Losing Their Magic

TinyFusion: How to Shrink Diffusion Transformers Without Losing Their Magic If you have been following the generative AI space recently, you know that Diffusion Transformers (DiTs) are the current heavyweights. From OpenAI’s Sora to Stable Diffusion 3, replacing the traditional U-Net backbone with a Transformer architecture has unlocked incredible capabilities in image and video generation. But there is a catch: these models are massive. They come with excessive parameter counts that make them slow and expensive to run in real-world applications. If you want to deploy a high-quality image generator on a mobile device or a standard consumer GPU, you are often out of luck. ...

[Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model 🔗](https://arxiv.org/abs/2411.19108)

TeaCache: Accelerating Video Diffusion by Watching the Inputs

Introduction We are living in the golden age of generative video. From Sora to Open-Sora and Latte, Diffusion Transformers (DiTs) have unlocked the ability to generate high-fidelity, coherent videos from simple text prompts. However, there is a massive bottleneck keeping these tools from real-time applications: inference speed. Generating a single second of video can take surprisingly long on consumer hardware. This is primarily due to the sequential nature of diffusion models. To create an image or video frame, the model must iteratively remove noise over dozens or hundreds of timesteps. It is a slow, methodical process where every step depends on the previous one. ...

[The Scene Language: Representing Scenes with Programs, Words, and Embeddings 🔗](https://arxiv.org/abs/2410.16770)

Bridging Code and Art: How 'Scene Language' Revolutionizes 3D Generation

Introduction How do you describe a scene? It sounds like a simple question, but try to be precise. Imagine you have just returned from a trip to Easter Island and you want to describe the famous Ahu Akivi site to a friend. You might say, “There are seven moai statues in a row, facing the same direction.” Your friend asks, “What is a moai?” You reply, “It’s a monolithic human figure carved from stone, with a large head and no legs.” “Do they look exactly the same?” “No,” you hesitate. “They share the same structure, but each has a slightly different weathered texture and distinct identity.” ...

[Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding 🔗](https://arxiv.org/abs/2502.10392)

Can We Make 3D Visual Grounding Real-Time? Enter TSP3D

Introduction: The “Where is it?” Problem in Robotics Imagine you are asking a home assistance robot to “pick up the red mug on the table to the left.” For a human, this is trivial. For a machine, this is a complex multi-modal puzzle known as 3D Visual Grounding (3DVG). The robot must parse the natural language command, perceive the 3D geometry of the room (usually via point clouds), understand the semantic relationships between objects (table, mug, left, right), and pinpoint the exact bounding box of the target. ...

[Task-driven Image Fusion with Learnable Fusion Loss 🔗](https://arxiv.org/abs/2412.03240)

Teaching Machines to See - How TDFusion Uses Meta-Learning for Task-Driven Image Fusion

Introduction In the world of computer vision, more data usually leads to better decisions. This is particularly true when dealing with multi-modal sensors. Consider an autonomous vehicle driving at night: a visible light camera captures the rich textures of the road but might miss a pedestrian in the shadows. Conversely, an infrared sensor picks up the pedestrian’s thermal signature clearly but loses the texture of the lane markings. The solution to this has long been Image Fusion: mathematically combining these two inputs into a single, comprehensive image. Traditionally, the goal of image fusion was to create an image that looks “good” to a human observer—balanced brightness, clear details, and high contrast. ...

[TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting 🔗](https://arxiv.org/abs/2503.17032)

TaoAvatar: Bringing Photorealistic, Real-Time Avatars to AR with 3D Gaussian Splatting

Introduction Imagine putting on an Augmented Reality (AR) headset like the Apple Vision Pro and having a conversation with a holographic projection of a friend or a virtual assistant. For the experience to be immersive, this avatar needs to look photorealistic, move naturally, and—crucially—respond in real-time. While we have seen incredible advances in digital human rendering, a significant gap remains between high-fidelity graphics and real-time performance on mobile hardware. Current industry standards often require massive computational power or rely on artist-created rigs that don’t scale well. On the academic side, neural methods like NeRF (Neural Radiance Fields) offer realism but are often too slow for mobile devices. ...

[Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs 🔗](https://arxiv.org/abs/2503.05082)

Taming the Hallucinations - How Video Diffusion Improves Sparse 3D Gaussian Splatting

Taming the Hallucinations: How Video Diffusion Improves Sparse 3D Gaussian Splatting Introduction Imagine you are trying to reconstruct a detailed 3D model of a room, but you only have six photographs taken from the center. This is the challenge of sparse-input 3D reconstruction. While recent technologies like 3D Gaussian Splatting (3DGS) have revolutionized how we render scenes, they typically demand a dense cloud of images to work their magic. When you feed them only a handful of views, the results are often riddled with “black holes,” floating artifacts, and blurred geometry. ...

[TKG-DM: Training-free Chroma Key Content Generation Diffusion Model 🔗](https://arxiv.org/abs/2411.15580)

Mastering the Green Screen: How TKG-DM Hacks Diffusion Noise for Perfect Chroma Keying

If you have ever played around with text-to-image models like Stable Diffusion or Midjourney, you know they are incredible at generating complex scenes. However, they often fail at a task that is trivial for traditional CGI but essential for graphic design and game development: generating a foreground object on a clean, removable background. Try prompting a model for “a cat on a solid green background.” You will likely get a cat, but the fur might be tinted green, the shadows might look unnatural, or the background might have weird textures. This is known as “color bleeding,” and it makes extracting the subject—a process known as chroma keying—a nightmare. ...

[TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction 🔗](https://arxiv.org/abs/2411.16788)

Beyond Augmentation: How TIDE Uses Local Concepts to Fix AI Generalization

Deep learning models are notoriously brittle. You train a model on high-quality photographs of dogs, and it achieves 99% accuracy. But show that same model a simple line sketch of a dog, or a photo of a dog in an unusual environment, and it falls apart. This problem is known as Domain Generalization (DG). Specifically, we often face the challenge of Single-Source Domain Generalization (SSDG), where we only have data from one domain (e.g., photos) but need the model to work everywhere (sketches, paintings, cartoons). ...

[TFCustom: Customized Image Generation with Time-Aware Frequency Feature Guidance 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Liu_TFCustom_Customized_Image_Generation_with_Time-Aware_Frequency_Feature_Guidance_CVPR_2025_paper.pdf)

Mastering Personalization in AI Art: How TFCustom Uses Time and Frequency to Perfect Detail

Introduction Imagine you have a photo of your specific hiking backpack—the one with the unique patches and a distinct texture. You want to generate an image of that exact backpack sitting on a bench in a futuristic city. You type the prompt into a standard text-to-image model, but the result is disappointing. It generates a backpack, sure, but it’s generic. It’s missing the patches. The texture looks like smooth plastic instead of canvas. ...

[T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting 🔗](https://arxiv.org/abs/2502.20625)

Can AI Count What It Hasn't Seen? Deep Dive into T2ICount and Zero-Shot Counting with Diffusion Models

Imagine you are looking at a photo of a picnic. There are three wicker baskets, fifty red apples, and two small teddy bears. If someone asks you to “count the bears,” you instantly focus on the two toys and ignore the sea of apples. This ability to filter visual information based on language is intuitive for humans. However, for Artificial Intelligence, this is a surprisingly difficult task. In the world of computer vision, this task is known as Zero-Shot Object Counting. The goal is to build a model that can count instances of any object category specified by a text description, without ever having been explicitly trained on that specific category. ...

[Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation 🔗](https://arxiv.org/abs/2411.17763)

Reflect3D: How Finding Symmetry in 2D Images Revolutionizes 3D AI

Introduction “Symmetry is what we see at a glance.” — Blaise Pascal. When you look at a photograph of a car, a chair, or a butterfly, your brain instantly infers its structure. You don’t need to see the other side to know it’s there; you intuitively understand that the object is symmetric. This perception is fundamental to how humans interpret the 3D world. However, for computer vision systems, detecting 3D symmetry from a single, flat 2D image is an immensely difficult task. ...

[Supervising Sound Localization by In-the-wild Egomotion 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Min_Supervising_Sound_Localization_by_In-the-wild_Egomotion_CVPR_2025_paper.pdf)

How Walking Around Helps AI Hear Better — Learning Sound Localization from Camera Motion

Introduction Imagine you are walking down a busy city street with your eyes closed. You hear a siren. To figure out where it’s coming from, you might instinctively turn your head or walk forward. As you move, the sound changes—if you rotate right and the sound stays to your left, you know exactly where it is relative to you. This dynamic relationship between movement (egomotion) and sound perception is fundamental to how humans navigate the world. ...

[StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer 🔗](https://arxiv.org/abs/2501.11319)

Fixing Style Transfer: How the Starting Point Determines the Destination

Introduction In the rapidly evolving world of Generative AI, Style Transfer remains one of the most fascinating applications. The goal is simple yet challenging: take the artistic appearance (style) of one image and apply it to the structure (content) of another. Imagine painting a photograph of your house using the brushstrokes of Van Gogh’s Starry Night. With the advent of Diffusion Models (like Stable Diffusion), the quality of generated images has skyrocketed. However, adapting these massive models for specific style transfer tasks typically requires expensive training or fine-tuning (like LoRA or DreamBooth). This led to the rise of training-free methods, which try to leverage the pre-trained knowledge of the model without modifying its weights. ...

[Style-Editor: Text-driven object-centric style editing 🔗](https://arxiv.org/abs/2408.08461)

Beyond Filters: How Style-Editor Uses Text to Edit Objects Without Masks

Introduction Imagine you are a graphic designer working on an advertisement. You have a perfect photo of a car on a mountain road, but the client wants the car to look “golden” instead of red. Traditionally, this means opening Photoshop, carefully tracing a mask around the car to separate it from the background, and then applying color grading layers. Now, imagine you could just type “Golden car” and an AI would handle the rest—changing the car’s texture while leaving the mountain road completely untouched. ...

[Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection 🔗](https://arxiv.org/abs/2503.09968)

Can AI Dream of Rainy Nights? Teaching Object Detectors to Evolve Styles via Chain-of-Thought

Imagine you are training a self-driving car. You live in a sunny coastal city, so you gather thousands of hours of driving footage—all under bright blue skies and clear visibility. You train your object detection model until it detects pedestrians and other cars perfectly. Now, you ship that car to London on a foggy, rainy night. Suddenly, the model fails. The pedestrians are obscured by mist; the cars are just blurs of red taillights reflecting on wet pavement. ...

[Structured 3D Latents for Scalable and Versatile 3D Generation 🔗](https://arxiv.org/abs/2412.01506)

Unifying 3D Generation: Inside TRELLIS and the Structured Latent Space

Introduction In the world of AI, 2D image generation has had its “iPhone moment.” Tools like Midjourney and DALL-E have made generating photorealistic images from text as easy as typing a sentence. However, the third dimension—3D generation—has remained a tougher nut to crack. Why is 3D so much harder? One major reason is the “format war.” In 2D, a pixel is a pixel. But in 3D, we have a chaotic mix of representations: meshes (vertices and faces), point clouds, Neural Radiance Fields (NeRFs), and the recently popularized 3D Gaussian Splatting. Each format has its own strengths and weaknesses. Meshes are great for geometry but hard to texture realistically via AI. NeRFs and Gaussians look amazing but are often unstructured “clouds” that are difficult to edit or import into a game engine. ...

[Structure-from-Motion with a Non-Parametric Camera Model 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Wang_Structure-from-Motion_with_a_Non-Parametric_Camera_Model_CVPR_2025_paper.pdf)

Beyond Pinhole Models—A New Era for Structure-from-Motion

Introduction Imagine trying to build a 3D map of a room using a collection of photos. This process, known as Structure-from-Motion (SfM), is the backbone of modern photogrammetry and 3D reconstruction. When you use standard photos from a smartphone or a DSLR, current algorithms like COLMAP work wonders. But what happens if you use a fisheye lens, a GoPro with wide FOV, or a complex catadioptric 360-degree camera? Suddenly, the standard pipelines break. ...

[Structure from Collision 🔗](https://arxiv.org/abs/2505.21335)

Cracking the Shell - How Collisions Reveal Invisible Internal Structures in NeRFs

Imagine looking at a pristine, opaque billiard ball sitting on a table. Now, imagine a ping-pong ball painted to look exactly like that billiard ball sitting next to it. To a camera—and to standard computer vision algorithms—these two objects are identical. They share the same geometry and the same surface texture. However, if you were to drop both balls, their true nature would instantly reveal itself. The solid billiard ball would land with a heavy thud, barely deforming. The hollow ping-pong ball would bounce, vibrate, and deform upon impact. The motion betrays the structure. ...

[SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving 🔗](https://arxiv.org/abs/2411.15482)

SplatFlow: Mastering Dynamic Scene Reconstruction Without Bounding Boxes

Introduction The race toward fully autonomous driving relies heavily on one critical resource: data. While real-world driving logs are invaluable, they are finite and often fail to capture the “long tail” of rare, dangerous edge cases. This is where simulation steps in. If we can create photorealistic, physics-compliant digital twins of the real world, we can train and test autonomous vehicles (AVs) in infinite variations of complex scenarios. However, reconstructing a dynamic urban environment from sensor data is notoriously difficult. Modern techniques like Neural Radiance Fields (NeRFs) and the more recent 3D Gaussian Splatting (3DGS) have revolutionized static scene reconstruction. They can render buildings and parked cars with breathtaking fidelity. But put a moving truck in the frame, and things fall apart. The moving object often appears as a ghostly, blurred trail, or artifacts corrupt the static background. ...