In the world of computer vision, object detection is a foundational task with applications ranging from autonomous driving to medical imaging. A persistent challenge in this field has been the trade-off between speed and accuracy. Highly precise models are often too slow for real-time scenarios, while faster models sometimes lack the accuracy needed for mission-critical applications.
But what if you could have both? What if you could train a state-of-the-art detector not in a massive, expensive cloud infrastructure, but on a single consumer-grade GPU sitting in your lab or home office?
This is the promise of YOLOv4: Optimal Speed and Accuracy of Object Detection. Published in 2020 by Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao, this work isn’t built on a single groundbreaking idea. Instead, it’s a masterclass in engineering and empirical science, where the authors meticulously evaluated and combined dozens of advances in deep learning to create a detector that is both highly accurate and blazing fast—democratizing high-performance object detection.
As shown in the performance chart from the paper:
YOLOv4 carves out a leading position, outperforming its predecessor YOLOv3 and matching or surpassing competitive models like EfficientDet, while maintaining real-time speeds.
In this post, we’ll explore the architecture, training strategies, and experiments that make YOLOv4 a landmark in object detection.
Background: The Anatomy of an Object Detector
Before dissecting YOLOv4, let’s establish a common understanding. Modern object detectors, whether one-stage or two-stage, share a similar structure:
Input: The image fed into the network. It can be a single image, patches for more localized features, or multi-scale images via an image pyramid.
Backbone: A deep CNN, often pre-trained on ImageNet, that extracts rich feature maps from the input. Examples: ResNet, VGG, DenseNet, or Darknet in YOLO.
Neck: Aggregates feature maps from different backbone layers, producing robust features for detecting objects at multiple scales. Popular methods include Feature Pyramid Networks (FPN) and Path Aggregation Networks (PAN).
Head: The prediction layer. It outputs classes and bounding boxes from the aggregated features.
- One-Stage Detectors (Dense Prediction): Predict objects directly from feature maps in a single step (e.g., YOLO, SSD)—fast but sometimes less accurate.
- Two-Stage Detectors (Sparse Prediction): First propose candidate regions, then classify/detect within them (e.g., Faster R-CNN)—more accurate but slower.
YOLOv4 is a one-stage detector—key to its real-time performance. The main goal of the authors was to optimize its architecture for parallel computation on standard GPUs, ensuring practicality and accessibility.
The Core Philosophy: Bag of Freebies & Bag of Specials
One of YOLOv4’s ingenious approaches is to categorize potential improvements:
Bag of Freebies (BoF): Techniques that improve accuracy without increasing inference cost. Typically applied during training. Examples: data augmentation, better loss functions.
Bag of Specials (BoS): Modules or post-processing steps that add minimal inference cost but boost accuracy. Examples: attention mechanisms, receptive field enhancements.
The paper methodically explores an array of BoF and BoS techniques, identifying the most effective mix for a superior detector.
Crafting the YOLOv4 Architecture
With this philosophy in mind, the authors carefully chose components for each part of YOLOv4.
Backbone: CSPDarknet53
A strong backbone for detection needs:
- Large receptive field to capture object context.
- High model capacity to represent complex features.
Comparisons led to CSPDarknet53 as the optimal choice:
While CSPResNeXt50 is slightly superior for image classification, CSPDarknet53—with 29 convolutional layers and a larger receptive field—proved better for object detection. CSP (Cross-Stage-Partial connections) improves gradient flow and efficiency.
Neck: SPP + PAN
Two modules enhance backbone features:
- SPP (Spatial Pyramid Pooling): Pools features at multiple scales and concatenates them, greatly increasing the receptive field without reducing speed.
- PAN (Path Aggregation Network): Enhances both high-level and low-level feature blending. YOLOv4 modifies PAN to further improve multi-scale fusion.
Head: YOLOv3-style Anchor-based Prediction
For the head, YOLOv4 adopts the proven YOLOv3 anchor-based prediction architecture, maintaining its strengths in fast, reliable bounding-box estimation.
The Winning Combination: YOLOv4’s BoF & BoS
Bag of Freebies (BoF)
Key techniques:
- Mosaic Data Augmentation: New in YOLOv4—combines four images into one, placing objects in varied contexts.
This increases object diversity per batch, reduces dependency on large batch sizes, and boosts robustness.
Self-Adversarial Training (SAT): Two-stage approach where the network first “attacks” the image to hide objects, then learns to detect them in the altered image.
CIoU Loss: A bounding-box regression loss considering overlap, center distance, and aspect ratio—yielding faster convergence and better localization than MSE/IoU losses.
Cross mini-Batch Normalization (CmBN): Aggregates normalization statistics across all mini-batches in a batch, improving stability for small batch sizes.
Bag of Specials (BoS)
Key techniques:
Mish Activation: Smooth, non-monotonic activation (
\(y = x \cdot \tanh(\text{softplus}(x))\)
) for better gradient flow and accuracy.Modified SAM & PAN:
- SAM simplified from dual pooling paths to single convolution.
- PAN changed from addition to concatenation to preserve more feature detail.
- DIoU-NMS: Enhances Non-Maximum Suppression by considering both IoU and center distance, improving detection for occluded objects.
Experiments: Validating Every Choice
Extensive ablation studies determined the impact of each technique on classification (ImageNet) and detection (MS COCO).
Classifier Training
The authors confirmed that CutMix, Mosaic, Label Smoothing, and Mish consistently improved backbone classification accuracy.
Detector Training
BoF Impact:
Mosaic augmentation and CIoU loss provided notable AP boosts.
BoS Impact:
Adding the modified SAM block to PAN-SPP yielded the best performance.
Final Backbone Choice & Batch Size Independence
Interestingly, the best classifier backbone wasn’t the best detector backbone. CSPDarknet53 with BoF + Mish led detection accuracy:
BoF/BoS also made the network robust to batch size differences:
Final Results: Dominating the Field
After careful tuning, YOLOv4 demonstrated exceptional speed-accuracy balance across GPUs:
Results:
- 43.5% AP (65.7% AP50) on MS COCO
- ~65 FPS on Tesla V100
- +10% AP and +12% FPS over YOLOv3
Conclusion: A New Real-Time Baseline
YOLOv4 is more than an incremental update—it’s a shift in how top-tier models can be built. Instead of chasing single novel components, the authors showcased the strength of systematic integration and empirical validation, delivering a detector that is:
- Fast & Accurate: New standard for real-time detection, pushing the Pareto frontier.
- Accessible: Trainable on a single consumer GPU with 8–16GB VRAM.
- Comprehensive: A proven playbook for integrating BoF and BoS into object detectors.
By blending the best ideas from the community with their own innovations, the YOLOv4 team didn’t just build a better detector—they crafted a better way to build detectors.