If you have bought a high-end smartphone in the last few years, you have likely benefited from the rapid evolution of image sensors. The quest for instantaneous autofocus has driven hardware engineers to move from standard sensors to Dual-Pixel (DP) sensors, and more recently, to Quad Photodiode (QPD) sensors.
While QPD sensors are designed primarily to make autofocus lightning-fast, they hide a secondary potential: depth estimation. Just as our two eyes allow us to perceive depth through stereo vision, the sub-pixels in a QPD sensor can theoretically function as tiny, multi-view cameras. However, extracting accurate depth (or disparity) from these sensors is notoriously difficult due to physical limitations like uneven lighting and microscopic distances between pixels.
In this post, we are diving deep into a CVPR paper titled “All-directional Disparity Estimation for Real-world QPD Images”. We will explore how researchers created the first real-world dataset for this technology and designed a novel deep learning architecture—QuadNet—to turn 2D raw sensor data into precise 3D depth maps.
1. The Hardware Challenge: What is QPD?
To understand the software solution, we first need to understand the hardware problem.
In a traditional camera sensor, one pixel corresponds to one photodiode. In a Dual-Pixel (DP) sensor, a single microlens covers two photodiodes (left and right). This allows the camera to perform “phase detection” autofocus.
Quad Photodiode (QPD) sensors take this a step further. As shown in Figure 1 (B) below, four photodiodes (colored red, green, blue, and yellow in the diagram) share a single on-chip lens.

This configuration provides four “sub-views”: Top-Left, Top-Right, Bottom-Left, and Bottom-Right. This allows for:
- All-directional autofocus: Detecting phase differences in both horizontal and vertical directions.
- Disparity Estimation: Calculating the shift in pixels between these sub-views to determine depth.
The Problem with QPD Disparity
While the concept sounds like standard stereo vision, the reality is much messier.
- Tiny Baseline: In stereo vision (like your eyes), the sensors are centimeters apart. In QPD, they are micrometers apart. This results in a very small disparity (pixel shift), often less than a few pixels, making it hard to detect.
- Uneven Illumination: Look at Figure 1 (D) above. The intensity plots for the Left (\(I_l\)) and Right (\(I_r\)) sub-images are not identical. Because of the angle of incoming light, one sub-pixel might be significantly brighter than its neighbor. This breaks standard matching algorithms that assume the same point in space will look identical in both views.
- No Ground Truth: Before this paper, there were no large-scale datasets mapping real-world QPD images to accurate depth maps.
2. Building the Foundation: The QPD2K Dataset
Deep learning needs data. Since no dataset existed for QPD disparity, the researchers built one. They introduced QPD2K, comprising 2,100 high-resolution real-world images.
Capturing “ground truth” disparity for a sensor that relies on defocus is tricky. You can’t just estimate it; you need to measure it physically. The team built a custom rig using a QPD sensor paired with a Structured Light System (a DLP projector).

As seen in Figure 2, the process involves projecting patterns (speckles and stripes) onto the scene. By analyzing how these patterns distort, they can calculate an incredibly accurate depth map (\(z\)).
However, the network needs to predict disparity (\(D\)), not absolute depth (\(z\)). For QPD sensors, the relationship between disparity and depth is governed by this equation:

Here, \(z\) is the depth, \(z_f\) is the focus distance, and parameters like \(\alpha\) and \(A\) relate to the physical sensor properties. This equation highlights that disparity in QPD sensors is an affine function of inverse depth.
To ensure the ground truth was robust against the “uneven illumination” problem mentioned earlier, the researchers calculated disparity in both horizontal and vertical directions and fused them based on confidence scores:

This rigorous data collection process provided the “gold standard” needed to train their neural networks.
3. The Core Method: From DPNet to QuadNet
The researchers tackled the problem in two stages. First, they developed DPNet to solve the uneven illumination and small baseline issues for Dual-Pixel data. Then, they extended this into QuadNet to fully utilize the four-directional data of QPD sensors.
Step 1: DPNet and the Illumination-Invariant Module
Standard Convolutional Neural Networks (CNNs) process raw pixel intensities. If the left view is brighter than the right view due to sensor physics, a standard CNN gets confused.
To fix this, DPNet employs an Illumination-Invariant Module. Instead of relying on raw intensity, it relies on edges, which remain consistent regardless of brightness changes. They achieved this using Differential Convolutions (DC).
A specialized version, called the Horizontal/Vertical Sobel Differential Convolution (HSDC/VSDC), calculates the difference between pixels rather than their absolute values. For a \(3 \times 3\) patch, the operation looks like this:

By focusing on the difference between neighboring pixels (\(x_1 - x_3\), etc.), the network extracts features that represent the structure of the scene rather than its lighting.
Coarse-to-Fine Estimation
Because the baseline is so small, the disparity is often sub-pixel (less than 1 pixel). A standard classification approach (is the shift 1, 2, or 3 pixels?) isn’t precise enough.
DPNet uses a Coarse-to-Fine approach.

- Coarse Stage: The network first estimates a rough disparity map (\(D_{init}\)) using a standard cost volume.
- Fine Stage: It creates a sub-pixel cost volume. It takes the features from the right image (\(g\)) and warps them towards the left image based on the initial coarse guess.

It then searches for the residual error (\(\Delta D_{sub}\)) within a tiny range around that guess using a group-wise correlation:

Finally, using a “soft argmin” function (which allows for continuous, decimal-point outputs rather than integer classes), it calculates the precise sub-pixel adjustment:

The final disparity is simply the coarse guess plus this fine-tuned adjustment:

Step 2: QuadNet and Edge-Aware Fusion
DPNet handles Left-Right pairs well. But QPD sensors also give us Top-Bottom pairs. Horizontal disparity (\(D_h\)) is great for detecting vertical edges (like a tree trunk), while vertical disparity (\(D_v\)) is better for horizontal edges (like a table edge).
QuadNet runs two instances of DPNet—one for the horizontal pair and one for the vertical pair—and fuses them.

But you can’t just average them. If you are looking at a horizontal window blind, the horizontal disparity might fail completely (the “aperture problem”). You need to trust the vertical disparity more in that specific region.
QuadNet uses an Edge-Aware Fusion Module. It extracts edge maps from the image and uses them as weights. If the network detects a strong horizontal edge, it increases the weight of the vertical disparity estimate, and vice versa.
Census-Based Refinement
Even after fusion, artifacts can remain. To polish the result, the authors use a refinement step based on the Census Transform. The Census Transform is a classic computer vision technique that encodes the local neighborhood structure of a pixel (e.g., “is this pixel brighter than its neighbors?”). It is extremely robust to illumination changes.
The network calculates the error between the left image and the warped right image using the Hamming Distance (HD) of their Census Transforms:

This error map guides a final refinement network (an hourglass structure) to produce the polished output, \(D_{qpd}\):

4. Experimental Results
The researchers compared their method against several State-of-the-Art (SOTA) approaches, including methods designed for Dual-Pixel sensors (like FaceDPNet and SFBD) and general stereo matching methods (like RAFT-stereo).
Quantitative Performance
The results on the QPD2K dataset were decisive. In the table below, metrics like “bad 0.3” represent the percentage of pixels where the error was greater than 0.3 pixels. Lower numbers are better.

QuadNet achieved the lowest error rates across the board. Notably, it outperformed QPDNet (a competitor) significantly in the strict “bad 0.3” metric (0.229 vs 0.909), showing its superior sub-pixel precision.
Qualitative Performance
Visually, the difference is stark. In Figure 5 below, look at the third row (the chair).
- SFBD (Column B) produces a very noisy map.
- QPDNet (Column E) captures the shape but loses definition.
- QuadNet (Column G) produces a clean, sharp depth map that closely resembles the Ground Truth (Column H).

Ablation Studies
To prove that every part of their complex architecture was necessary, the authors performed ablation studies—removing specific modules to see how performance dropped.

As shown in Figure 6, the “Base” model (B) is vague. Adding the Illumination Invariant Module (C) helps with lighting. Adding Sub-pixel refinement (D) sharpens the depth. Combining everything (E) yields the crispest result.
The numerical data backs this up:

The jump from the Base model (bad 0.3 = 0.445) to the full model (bad 0.3 = 0.229) is a massive improvement in accuracy.
Generalization to Dual-Pixel
To show that their DPNet foundation wasn’t just good for QPD, they tested it on existing Dual-Pixel datasets (DP5K and DP-disp).

Even on datasets it wasn’t specifically designed for, DPNet (Column E in Figure 8 below) managed to resolve fine details, such as the gaps between the figurines, which other methods blurred over.

5. Conclusion and Implications
This research marks a significant step forward for mobile photography and computer vision.
- A New Benchmark: The QPD2K dataset resolves a major bottleneck in the field, providing the first high-quality playground for researchers to test QPD algorithms.
- Solving Real-World Physics: The Illumination-Invariant Module elegantly handles the physical reality of QPD sensors (uneven light sensitivity) without needing complex hardware calibration.
- Sub-Pixel Precision: The Coarse-to-Fine architecture proves that deep learning can recover depth information even when the physical disparity is microscopic.
By effectively fusing horizontal and vertical disparity, QuadNet demonstrates that QPD sensors are not just for autofocus—they are capable 3D imaging devices. This technology could lead to smartphones with significantly better portrait modes, improved low-light focus, and perhaps even 3D scanning capabilities, all without adding extra cameras to the back of the phone.
](https://deep-paper.org/en/paper/file-1930/images/cover.png)