In the world of computer vision, object detection is a foundational task with applications ranging from autonomous driving to medical imaging. A persistent challenge in this field has been the trade-off between speed and accuracy. Highly precise models are often too slow for real-time scenarios, while faster models sometimes lack the accuracy needed for mission-critical applications.

But what if you could have both? What if you could train a state-of-the-art detector not in a massive, expensive cloud infrastructure, but on a single consumer-grade GPU sitting in your lab or home office?

This is the promise of YOLOv4: Optimal Speed and Accuracy of Object Detection. Published in 2020 by Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao, this work isn’t built on a single groundbreaking idea. Instead, it’s a masterclass in engineering and empirical science, where the authors meticulously evaluated and combined dozens of advances in deep learning to create a detector that is both highly accurate and blazing fast—democratizing high-performance object detection.

As shown in the performance chart from the paper:

Figure 1 shows a performance plot comparing YOLOv4 to other object detectors like EfficientDet and YOLOv3. YOLOv4 achieves a high AP score (around 43.5%) at a very high FPS (around 65 on a V100), demonstrating its superior balance of speed and accuracy.

YOLOv4 carves out a leading position, outperforming its predecessor YOLOv3 and matching or surpassing competitive models like EfficientDet, while maintaining real-time speeds.

In this post, we’ll explore the architecture, training strategies, and experiments that make YOLOv4 a landmark in object detection.


Background: The Anatomy of an Object Detector

Before dissecting YOLOv4, let’s establish a common understanding. Modern object detectors, whether one-stage or two-stage, share a similar structure:

Figure 2 illustrates the general architecture of an object detector, composed of an Input, a Backbone for feature extraction, a Neck for feature aggregation, and a Head for making predictions.

  1. Input: The image fed into the network. It can be a single image, patches for more localized features, or multi-scale images via an image pyramid.

  2. Backbone: A deep CNN, often pre-trained on ImageNet, that extracts rich feature maps from the input. Examples: ResNet, VGG, DenseNet, or Darknet in YOLO.

  3. Neck: Aggregates feature maps from different backbone layers, producing robust features for detecting objects at multiple scales. Popular methods include Feature Pyramid Networks (FPN) and Path Aggregation Networks (PAN).

  4. Head: The prediction layer. It outputs classes and bounding boxes from the aggregated features.

    • One-Stage Detectors (Dense Prediction): Predict objects directly from feature maps in a single step (e.g., YOLO, SSD)—fast but sometimes less accurate.
    • Two-Stage Detectors (Sparse Prediction): First propose candidate regions, then classify/detect within them (e.g., Faster R-CNN)—more accurate but slower.

YOLOv4 is a one-stage detector—key to its real-time performance. The main goal of the authors was to optimize its architecture for parallel computation on standard GPUs, ensuring practicality and accessibility.


The Core Philosophy: Bag of Freebies & Bag of Specials

One of YOLOv4’s ingenious approaches is to categorize potential improvements:

  • Bag of Freebies (BoF): Techniques that improve accuracy without increasing inference cost. Typically applied during training. Examples: data augmentation, better loss functions.

  • Bag of Specials (BoS): Modules or post-processing steps that add minimal inference cost but boost accuracy. Examples: attention mechanisms, receptive field enhancements.

The paper methodically explores an array of BoF and BoS techniques, identifying the most effective mix for a superior detector.


Crafting the YOLOv4 Architecture

With this philosophy in mind, the authors carefully chose components for each part of YOLOv4.

Backbone: CSPDarknet53

A strong backbone for detection needs:

  • Large receptive field to capture object context.
  • High model capacity to represent complex features.

Comparisons led to CSPDarknet53 as the optimal choice:

Table 1 compares three backbone models. CSPDarknet53 has a larger receptive field (725x725) and more parameters (27.6M) than CSPResNeXt50, making it a more powerful feature extractor for object detection despite CSPResNeXt50 being slightly better for pure classification.

While CSPResNeXt50 is slightly superior for image classification, CSPDarknet53—with 29 convolutional layers and a larger receptive field—proved better for object detection. CSP (Cross-Stage-Partial connections) improves gradient flow and efficiency.

Neck: SPP + PAN

Two modules enhance backbone features:

  1. SPP (Spatial Pyramid Pooling): Pools features at multiple scales and concatenates them, greatly increasing the receptive field without reducing speed.
  2. PAN (Path Aggregation Network): Enhances both high-level and low-level feature blending. YOLOv4 modifies PAN to further improve multi-scale fusion.

Head: YOLOv3-style Anchor-based Prediction

For the head, YOLOv4 adopts the proven YOLOv3 anchor-based prediction architecture, maintaining its strengths in fast, reliable bounding-box estimation.


The Winning Combination: YOLOv4’s BoF & BoS

Bag of Freebies (BoF)

Key techniques:

  • Mosaic Data Augmentation: New in YOLOv4—combines four images into one, placing objects in varied contexts.

Figure 3 shows examples of Mosaic data augmentation, where four different images are combined into a single training image, creating a collage-like effect.

This increases object diversity per batch, reduces dependency on large batch sizes, and boosts robustness.

  • Self-Adversarial Training (SAT): Two-stage approach where the network first “attacks” the image to hide objects, then learns to detect them in the altered image.

  • CIoU Loss: A bounding-box regression loss considering overlap, center distance, and aspect ratio—yielding faster convergence and better localization than MSE/IoU losses.

  • Cross mini-Batch Normalization (CmBN): Aggregates normalization statistics across all mini-batches in a batch, improving stability for small batch sizes.

Figure 4 illustrates the difference between BN, CBN, and the proposed CmBN. CmBN accumulates statistics across the mini-batches (t-3 to t) within one batch before updating the weights.

Bag of Specials (BoS)

Key techniques:

  • Mish Activation: Smooth, non-monotonic activation (\(y = x \cdot \tanh(\text{softplus}(x))\)) for better gradient flow and accuracy.

  • Modified SAM & PAN:

    • SAM simplified from dual pooling paths to single convolution.
    • PAN changed from addition to concatenation to preserve more feature detail.

Figure 5 shows the original SAM with parallel pooling paths and the modified, simpler version used in YOLOv4. Figure 6 shows the original PAN using addition to merge features, contrasted with the modified PAN which uses concatenation.

  • DIoU-NMS: Enhances Non-Maximum Suppression by considering both IoU and center distance, improving detection for occluded objects.

Experiments: Validating Every Choice

Extensive ablation studies determined the impact of each technique on classification (ImageNet) and detection (MS COCO).

Classifier Training

The authors confirmed that CutMix, Mosaic, Label Smoothing, and Mish consistently improved backbone classification accuracy.

Figure 7 shows a visual gallery of different data augmentation techniques, including standard transformations, MixUp, CutMix, Mosaic, and blurring.

Table 2 shows the impact of various BoF and Mish on CSPResNeXt-50’s accuracy. The combination of CutMix, Mosaic, Label Smoothing, and Mish yields the highest Top-1 accuracy of 79.8%. Table 3 shows a similar experiment for the CSPDarknet-53 backbone, confirming that these techniques also boost its performance.

Detector Training

BoF Impact:
Mosaic augmentation and CIoU loss provided notable AP boosts.

Table 4 is an ablation study of Bag-of-Freebies. The final row shows that combining multiple BoF techniques like grid sensitivity elimination (S), Mosaic (M), optimal anchors (OA), and CIoU loss results in a high AP of 42.4%.

BoS Impact:
Adding the modified SAM block to PAN-SPP yielded the best performance.

Table 5 is an ablation study of Bag-of-Specials. It shows that adding a modified SAM block to the CSPResNeXt50-PANet-SPP architecture provides the best accuracy.

Final Backbone Choice & Batch Size Independence

Interestingly, the best classifier backbone wasn’t the best detector backbone. CSPDarknet53 with BoF + Mish led detection accuracy:

Table 6 compares detector performance using different pre-trained backbones. The CSPDarknet53 backbone trained with BoF and Mish achieves the highest AP of 43.0%.

BoF/BoS also made the network robust to batch size differences:

Table 7 shows that with BoF and BoS, the detector’s performance is nearly identical for mini-batch sizes of 4 and 8. This proves its training efficiency on single GPUs.


Final Results: Dominating the Field

After careful tuning, YOLOv4 demonstrated exceptional speed-accuracy balance across GPUs:

Figure 8 presents a series of plots comparing YOLOv4’s speed and accuracy against dozens of other detectors on different GPU architectures (Maxwell, Pascal, Volta).

Results:

  • 43.5% AP (65.7% AP50) on MS COCO
  • ~65 FPS on Tesla V100
  • +10% AP and +12% FPS over YOLOv3

Table 8 shows YOLOv4’s performance on Maxwell GPUs, where it significantly outperforms other real-time detectors like YOLOv3 and SSD. Table 9 shows YOLOv4’s performance on Pascal GPUs, again demonstrating a superior speed-accuracy balance. Table 10 shows YOLOv4’s performance on Volta GPUs, where it runs at an impressive 62 FPS with 43.5% AP, faster and more accurate than competitors like EfficientDet at similar speeds.


Conclusion: A New Real-Time Baseline

YOLOv4 is more than an incremental update—it’s a shift in how top-tier models can be built. Instead of chasing single novel components, the authors showcased the strength of systematic integration and empirical validation, delivering a detector that is:

  • Fast & Accurate: New standard for real-time detection, pushing the Pareto frontier.
  • Accessible: Trainable on a single consumer GPU with 8–16GB VRAM.
  • Comprehensive: A proven playbook for integrating BoF and BoS into object detectors.

By blending the best ideas from the community with their own innovations, the YOLOv4 team didn’t just build a better detector—they crafted a better way to build detectors.