If you have bought a high-end smartphone in the last few years, you have likely benefited from the rapid evolution of image sensors. The quest for instantaneous autofocus has driven hardware engineers to move from standard sensors to Dual-Pixel (DP) sensors, and more recently, to Quad Photodiode (QPD) sensors.

While QPD sensors are designed primarily to make autofocus lightning-fast, they hide a secondary potential: depth estimation. Just as our two eyes allow us to perceive depth through stereo vision, the sub-pixels in a QPD sensor can theoretically function as tiny, multi-view cameras. However, extracting accurate depth (or disparity) from these sensors is notoriously difficult due to physical limitations like uneven lighting and microscopic distances between pixels.

In this post, we are diving deep into a CVPR paper titled “All-directional Disparity Estimation for Real-world QPD Images”. We will explore how researchers created the first real-world dataset for this technology and designed a novel deep learning architecture—QuadNet—to turn 2D raw sensor data into precise 3D depth maps.

1. The Hardware Challenge: What is QPD?

To understand the software solution, we first need to understand the hardware problem.

In a traditional camera sensor, one pixel corresponds to one photodiode. In a Dual-Pixel (DP) sensor, a single microlens covers two photodiodes (left and right). This allows the camera to perform “phase detection” autofocus.

Quad Photodiode (QPD) sensors take this a step further. As shown in Figure 1 (B) below, four photodiodes (colored red, green, blue, and yellow in the diagram) share a single on-chip lens.

Figure 1. (A) The disparity of QPD sensors only presents in defocused regions. (B) The light rays passing through the on-chip lens fall on four photodiodes. (C) Horizontal bidisparity sample. (D) The left and right sub-images exhibit uneven brightness.

This configuration provides four “sub-views”: Top-Left, Top-Right, Bottom-Left, and Bottom-Right. This allows for:

  1. All-directional autofocus: Detecting phase differences in both horizontal and vertical directions.
  2. Disparity Estimation: Calculating the shift in pixels between these sub-views to determine depth.

The Problem with QPD Disparity

While the concept sounds like standard stereo vision, the reality is much messier.

  1. Tiny Baseline: In stereo vision (like your eyes), the sensors are centimeters apart. In QPD, they are micrometers apart. This results in a very small disparity (pixel shift), often less than a few pixels, making it hard to detect.
  2. Uneven Illumination: Look at Figure 1 (D) above. The intensity plots for the Left (\(I_l\)) and Right (\(I_r\)) sub-images are not identical. Because of the angle of incoming light, one sub-pixel might be significantly brighter than its neighbor. This breaks standard matching algorithms that assume the same point in space will look identical in both views.
  3. No Ground Truth: Before this paper, there were no large-scale datasets mapping real-world QPD images to accurate depth maps.

2. Building the Foundation: The QPD2K Dataset

Deep learning needs data. Since no dataset existed for QPD disparity, the researchers built one. They introduced QPD2K, comprising 2,100 high-resolution real-world images.

Capturing “ground truth” disparity for a sensor that relies on defocus is tricky. You can’t just estimate it; you need to measure it physically. The team built a custom rig using a QPD sensor paired with a Structured Light System (a DLP projector).

Figure 2. (A) Hardware setup: stereo QPD sensors and a DLP projector. (B-D) Depth refinement process. (E) Final Disparity map.

As seen in Figure 2, the process involves projecting patterns (speckles and stripes) onto the scene. By analyzing how these patterns distort, they can calculate an incredibly accurate depth map (\(z\)).

However, the network needs to predict disparity (\(D\)), not absolute depth (\(z\)). For QPD sensors, the relationship between disparity and depth is governed by this equation:

Equation relating QPD disparity to depth based on aperture and focal length.

Here, \(z\) is the depth, \(z_f\) is the focus distance, and parameters like \(\alpha\) and \(A\) relate to the physical sensor properties. This equation highlights that disparity in QPD sensors is an affine function of inverse depth.

To ensure the ground truth was robust against the “uneven illumination” problem mentioned earlier, the researchers calculated disparity in both horizontal and vertical directions and fused them based on confidence scores:

Equation for confidence-weighted fusion of horizontal and vertical disparity.

This rigorous data collection process provided the “gold standard” needed to train their neural networks.

3. The Core Method: From DPNet to QuadNet

The researchers tackled the problem in two stages. First, they developed DPNet to solve the uneven illumination and small baseline issues for Dual-Pixel data. Then, they extended this into QuadNet to fully utilize the four-directional data of QPD sensors.

Step 1: DPNet and the Illumination-Invariant Module

Standard Convolutional Neural Networks (CNNs) process raw pixel intensities. If the left view is brighter than the right view due to sensor physics, a standard CNN gets confused.

To fix this, DPNet employs an Illumination-Invariant Module. Instead of relying on raw intensity, it relies on edges, which remain consistent regardless of brightness changes. They achieved this using Differential Convolutions (DC).

A specialized version, called the Horizontal/Vertical Sobel Differential Convolution (HSDC/VSDC), calculates the difference between pixels rather than their absolute values. For a \(3 \times 3\) patch, the operation looks like this:

Equation showing the calculation for differential convolution.

By focusing on the difference between neighboring pixels (\(x_1 - x_3\), etc.), the network extracts features that represent the structure of the scene rather than its lighting.

Coarse-to-Fine Estimation

Because the baseline is so small, the disparity is often sub-pixel (less than 1 pixel). A standard classification approach (is the shift 1, 2, or 3 pixels?) isn’t precise enough.

DPNet uses a Coarse-to-Fine approach.

Figure 3. The architecture of our DPNet network. Includes Illumination-invariant module and Coarse-to-fine module.

  1. Coarse Stage: The network first estimates a rough disparity map (\(D_{init}\)) using a standard cost volume.
  2. Fine Stage: It creates a sub-pixel cost volume. It takes the features from the right image (\(g\)) and warps them towards the left image based on the initial coarse guess.

Equation showing the warping of feature g by the initial disparity.

It then searches for the residual error (\(\Delta D_{sub}\)) within a tiny range around that guess using a group-wise correlation:

Equation for sub-pixel correlation calculation.

Finally, using a “soft argmin” function (which allows for continuous, decimal-point outputs rather than integer classes), it calculates the precise sub-pixel adjustment:

Equation for calculating the delta disparity using softmax.

The final disparity is simply the coarse guess plus this fine-tuned adjustment:

Equation showing final disparity as the sum of initial and delta disparity.

Step 2: QuadNet and Edge-Aware Fusion

DPNet handles Left-Right pairs well. But QPD sensors also give us Top-Bottom pairs. Horizontal disparity (\(D_h\)) is great for detecting vertical edges (like a tree trunk), while vertical disparity (\(D_v\)) is better for horizontal edges (like a table edge).

QuadNet runs two instances of DPNet—one for the horizontal pair and one for the vertical pair—and fuses them.

Figure 4. Overview of our QuadNet network utilizing edge features for fusion.

But you can’t just average them. If you are looking at a horizontal window blind, the horizontal disparity might fail completely (the “aperture problem”). You need to trust the vertical disparity more in that specific region.

QuadNet uses an Edge-Aware Fusion Module. It extracts edge maps from the image and uses them as weights. If the network detects a strong horizontal edge, it increases the weight of the vertical disparity estimate, and vice versa.

Census-Based Refinement

Even after fusion, artifacts can remain. To polish the result, the authors use a refinement step based on the Census Transform. The Census Transform is a classic computer vision technique that encodes the local neighborhood structure of a pixel (e.g., “is this pixel brighter than its neighbors?”). It is extremely robust to illumination changes.

The network calculates the error between the left image and the warped right image using the Hamming Distance (HD) of their Census Transforms:

Equation showing Census Transform error calculation.

This error map guides a final refinement network (an hourglass structure) to produce the polished output, \(D_{qpd}\):

Equation showing the final QuadNet disparity calculation.

4. Experimental Results

The researchers compared their method against several State-of-the-Art (SOTA) approaches, including methods designed for Dual-Pixel sensors (like FaceDPNet and SFBD) and general stereo matching methods (like RAFT-stereo).

Quantitative Performance

The results on the QPD2K dataset were decisive. In the table below, metrics like “bad 0.3” represent the percentage of pixels where the error was greater than 0.3 pixels. Lower numbers are better.

Table 1. Quantitative results on QPD2K dataset comparing DPNet and QuadNet to SOTA.

QuadNet achieved the lowest error rates across the board. Notably, it outperformed QPDNet (a competitor) significantly in the strict “bad 0.3” metric (0.229 vs 0.909), showing its superior sub-pixel precision.

Qualitative Performance

Visually, the difference is stark. In Figure 5 below, look at the third row (the chair).

  • SFBD (Column B) produces a very noisy map.
  • QPDNet (Column E) captures the shape but loses definition.
  • QuadNet (Column G) produces a clean, sharp depth map that closely resembles the Ground Truth (Column H).

Figure 5. Qualitative experimental results on QPD2K. QuadNet shows the best overall performance compared to other methods.

Ablation Studies

To prove that every part of their complex architecture was necessary, the authors performed ablation studies—removing specific modules to see how performance dropped.

Figure 6. Ablation results on QPD2K dataset. Visual comparison of removing different modules.

As shown in Figure 6, the “Base” model (B) is vague. Adding the Illumination Invariant Module (C) helps with lighting. Adding Sub-pixel refinement (D) sharpens the depth. Combining everything (E) yields the crispest result.

The numerical data backs this up:

Table 2. Ablation study showing incremental improvements with each module.

The jump from the Base model (bad 0.3 = 0.445) to the full model (bad 0.3 = 0.229) is a massive improvement in accuracy.

Generalization to Dual-Pixel

To show that their DPNet foundation wasn’t just good for QPD, they tested it on existing Dual-Pixel datasets (DP5K and DP-disp).

Figure 7. Qualitative experimental results on DP5K. DPNet performs better in blurred regions.

Even on datasets it wasn’t specifically designed for, DPNet (Column E in Figure 8 below) managed to resolve fine details, such as the gaps between the figurines, which other methods blurred over.

Figure 8. Qualitative experimental results on DP-disp. DPNet outperforms SOTA in detailed regions.

5. Conclusion and Implications

This research marks a significant step forward for mobile photography and computer vision.

  1. A New Benchmark: The QPD2K dataset resolves a major bottleneck in the field, providing the first high-quality playground for researchers to test QPD algorithms.
  2. Solving Real-World Physics: The Illumination-Invariant Module elegantly handles the physical reality of QPD sensors (uneven light sensitivity) without needing complex hardware calibration.
  3. Sub-Pixel Precision: The Coarse-to-Fine architecture proves that deep learning can recover depth information even when the physical disparity is microscopic.

By effectively fusing horizontal and vertical disparity, QuadNet demonstrates that QPD sensors are not just for autofocus—they are capable 3D imaging devices. This technology could lead to smartphones with significantly better portrait modes, improved low-light focus, and perhaps even 3D scanning capabilities, all without adding extra cameras to the back of the phone.