From v1 to v8 and Beyond: The Complete Story of YOLO

In the world of computer vision, few algorithms have made an impact as significant and lasting as YOLO (You Only Look Once). From enabling self-driving cars to perceive the world around them to powering automated checkout systems, real-time object detection has become a cornerstone of modern AI. At the heart of this revolution is YOLO—a family of models celebrated for their incredible balance of speed and accuracy.

Since its debut in 2015, YOLO has undergone an extraordinary evolution. Each new version has pushed the boundaries of what’s possible, introducing clever architectural changes and novel training techniques. This article will take you on a comprehensive journey through the entire history of YOLO, from the groundbreaking original all the way to the latest state-of-the-art versions like YOLOv8 and the AI-designed YOLO-NAS.

Whether you’re a student just starting out in deep learning or a practitioner looking to understand the mechanics behind these powerful models, this guide will break down the core concepts, key innovations, and the story of how YOLO became a titan in computer vision.

A timeline showing the major YOLO versions from YOLOv1 in 2015 to YOLOv8 and YOLO-NAS in 2023.

Figure 1: Timeline of major YOLO versions from 2015 to 2023.

First, Some Ground Rules: How We Measure Success

Before we dive into the first YOLO model, it’s important to understand how object detectors are evaluated. Without a solid grasp of these metrics, the improvements from one version to the next won’t make much sense.

Intersection over Union (IoU)

The most fundamental concept is Intersection over Union (IoU). Imagine our model predicts a bounding box for a cat, and we have the ground truth box from our dataset. How do we know if the prediction is any good? We measure the overlap.

IoU is the ratio of the area of overlap between the predicted box and the ground truth box to the total area covered by both (their union). The value ranges from 0 (no overlap) to 1 (perfect overlap). Typically, a prediction is considered a true positive if its IoU with a ground truth box is above a certain threshold, often 0.5.

An illustration explaining Intersection over Union (IoU). It shows the area of overlap divided by the area of union and examples of poor, good, and excellent IoU values.

Figure 2: IoU visualized with examples of poor, good, and excellent overlap.

Average Precision (AP)

The primary metric for object detection is Average Precision (AP), often referred to as mean Average Precision (mAP). AP provides a single number that captures the model’s performance across all object classes and confidence levels. It’s calculated from the precision–recall curve:

Precision: Of all the objects we predicted, how many were correct?
Recall: Of all the actual objects in the image, how many did we find?

There’s a natural trade-off: if you try to find every single object (high recall), you’ll likely make more mistakes (lower precision). AP elegantly summarizes this balance. The MS COCO dataset, a modern standard for benchmarking, calculates AP by averaging over multiple IoU thresholds (from 0.5 to 0.95), making it a rigorous metric.

Non-Maximum Suppression (NMS)

Object detectors often output multiple bounding boxes for the same object. To clean this up, we use Non-Maximum Suppression (NMS):

Take the box with the highest confidence score.
Remove any other boxes that have a high IoU with it.
Repeat until no boxes remain.

This ensures we get one clean, confident prediction per object.

A before-and-after diagram showing multiple bounding box predictions around a dog, which are then filtered to a single box using NMS.

Figure 3: NMS eliminates redundant overlapping boxes, leaving only the most confident predictions.

YOLOv1: The Revolution of a Single Glance

Published in 2016, the original YOLO was a radical departure from the object detection methods of its time. Previous state-of-the-art models like R-CNN (Region-based Convolutional Neural Networks) used a two-stage approach: first propose possible regions containing objects, then run a classifier on each region. This was accurate but slow.

YOLO’s creators simplified this dramatically: treat object detection as a single regression problem. The network processes the entire image in a single pass, predicting all bounding boxes and class probabilities at once.

How It Works

Grid System: YOLOv1 divides the input image into an \( S \times S \) grid (here \( S = 7 \)).
Cell Responsibility: Each grid cell detects objects whose center falls inside it.
Predictions per Cell: Each cell predicts:
- \( B \) bounding boxes (here \( B = 2 \)), each with \( x, y, w, h \) and a confidence score \( P_c \).
- \( C \) class probabilities for the object’s category.

Final output: \( S \times S \times (B \times 5 + C) \). On PASCAL VOC with \( C = 20 \), this was a \( 7 \times 7 \times 30 \) tensor. This unification made YOLO blazing fast.

Diagram illustrating YOLOv1 output for a 3×3 grid, with each vector containing object presence, bounding box coordinates, and class labels.

Figure 4: Simplified YOLOv1 output structure.

Architecture and Loss Function

YOLOv1’s architecture: 24 convolutional layers followed by two fully connected layers, inspired by GoogLeNet’s use of \( 1\times1 \) convolutions to reduce parameters.

Detailed YOLOv1 architecture layers.

Figure 5: YOLOv1 architecture — convolutional backbone and fully connected detection head.

The loss function was sum-squared error with three parts:

Localization Loss: Penalizes errors in box coordinates.
Confidence Loss: Penalizes incorrect objectness scores, weighted less for boxes without objects.
Classification Loss: Penalizes incorrect class probability predictions.

Breakdown of YOLOv1 loss function.

Figure 6: YOLOv1 loss combining localization, confidence, and classification components.

Strengths and Limitations

YOLOv1 was revolutionary in speed, enabling real-time detection. Limitations included difficulty with small objects, only one class per cell, and less accurate localization compared to two-stage detectors.

YOLOv2: Better, Faster, Stronger

In 2017, YOLOv2 addressed v1’s weaknesses while preserving speed.

Key improvements:

Batch Normalization on all conv layers.
High-Resolution Classifier pre-training at higher resolution.
Anchor Boxes: Predict offsets from predefined shapes.

Figure 7: Multiple anchor boxes per grid cell.

Dimension Clusters: K-means on training boxes to choose anchor shapes.
Direct Location Prediction: Sigmoid-constrained centers within grid cells for stability.

Figure 8: YOLOv2 bounding box prediction from priors.

Finer-Grained Features: Passthrough layer merging detailed early features with deep layers.
Multi-Scale Training: Randomly varying input size during training.

Backbone: Darknet-19 — efficient 19-layer network.
Darknet-19 architecture table.

Figure 9: YOLOv2 architecture with Darknet-19 backbone.

YOLO9000: Joint training with COCO (detection) + ImageNet (classification) enabled detection of over 9,000 categories.

Result: 78.6% AP on VOC, up from YOLOv1’s 63.4%.

YOLOv3: The Multi-Scale Revolution

Released in 2018, YOLOv3 improved accuracy and small-object detection.

Innovations:

Deeper Backbone: Darknet-53 with residual connections.

Figure 10: YOLOv3’s 53-layer Darknet with residual blocks.

Predictions at Three Scales: Using FPN principles — large objects on coarse grids, small on fine grids.

Figure 11: YOLOv3 multi-scale outputs for small, medium, and large objects.

Binary Cross-Entropy Class Prediction: Independent logistic classifiers for multi-label capability.

YOLOv3 achieved state-of-the-art accuracy with high speed.

Backbone, Neck, and Head

Modern detectors are described by:

Backbone: CNN feature extractor (e.g., Darknet, ResNet).
Neck: Feature aggregation/refinement layer (e.g., FPN, PANet).
Head: Prediction layer for boxes and classes.

Diagram showing backbone, neck, and head roles.

Figure 12: The three stages of an object detector.

YOLOv4: The Bag of Tricks

Released in 2020 by new authors, YOLOv4 tested dozens of methods, categorizing them as:

Bag-of-Freebies (BoF): No inference cost; e.g., Mosaic augmentation, CutMix.
Bag-of-Specials (BoS): Slight inference cost; e.g., Mish activation, SPP.

Architecture: CSPDarknet53-PANet-SPP — CSP backbone, PANet neck, SPP block.
YOLOv4 architecture diagram.

Figure 13: YOLOv4’s CSPDarknet53 backbone, SPP neck, and YOLO head.

BoF and BoS table.

Figure 14: YOLOv4 training techniques.

YOLOv5: PyTorch Era and Scalability

Released in 2020 by Ultralytics in PyTorch, YOLOv5 is easy to train and deploy.

Architecture: Modified CSPDarknet53 backbone, CSP-PAN neck, SPPF block.
Key Feature: AutoAnchor for dataset-specific anchors.
Scalability: Models from nano to extra large.

YOLOv5 architecture.

Figure 15: YOLOv5 backbone–neck–head design with SPPF.

YOLOX, YOLOv6, YOLOv7

YOLOX (2021)

Anchor-free design, decoupled head for classification/regression.
Coupled vs decoupled head comparison.

Figure 16: YOLOX decoupled head design.

YOLOv6 (2022)

RepVGG backbone for efficiency, quantization techniques.
YOLOv6 architecture.

Figure 17: YOLOv6 with RepVGG backbone and PAN neck.

YOLOv7 (2022)

Extended Efficient Layer Aggregation Network (E-ELAN), trainable bag-of-freebies.
YOLOv7 architecture.

Figure 18: YOLOv7 with E-ELAN blocks.

YOLOv8: Anchor-Free, Decoupled, and Versatile

Released in 2023 by Ultralytics:

Anchor-free like YOLOX.
New C2f backbone module.
Decoupled head.
Supports detection, segmentation, classification.

YOLOv8 architecture.

Figure 19: YOLOv8’s modified backbone and decoupled head.

Specialized and AI-Designed YOLOs

PP-YOLO Series

Developed in PaddlePaddle, incremental improvements leading to PP-YOLOE anchor-free design.
PP-YOLOE architecture.

Figure 20: PP-YOLOE with CSPRepResNet backbone.

YOLO-NAS (2023)

Figure 21: YOLO-NAS automatically searched architecture.

YOLO with Transformers

Hybrid architectures combining CNN efficiency with Transformer global context.
ViT-YOLO architecture.

Figure 22: ViT-YOLO combining MHSA with CSPDark.

Discussion: Patterns of Evolution

Summary table of YOLO versions.

Figure 23: YOLO evolution — versions, framework, anchors, backbone, AP.

Trends:

Anchors: Adopted in v2, abandoned again in anchor-free designs from YOLOX onwards.
Frameworks: Migration from Darknet to PyTorch unleashed rapid innovation.
Architectures: From simple conv stacks to AI-searched hybrid designs.

Trade-off: Speed vs Accuracy

Scaling model size (nano to xlarge) allows explicit choice of speed vs accuracy.
Accuracy vs size and latency plots for YOLOv5-v8.

Figure 24: YOLOv8 offers best accuracy–latency trade-off.

The Future of YOLO

Expect:

New Techniques: Incorporating breakthroughs in neural architecture and training.
Beyond Detection: Stronger multi-task support for segmentation, tracking, pose estimation, multimodal.
Hardware Adaptation: Co-design with accelerators, quantization, NAS for efficiency.

From an elegant idea in 2015 to today’s versatile ecosystem, YOLO’s journey is a testament to innovation and community collaboration. Its story is still being written.

First, Some Ground Rules: How We Measure Success#

Intersection over Union (IoU)#

Average Precision (AP)#

Non-Maximum Suppression (NMS)#

YOLOv1: The Revolution of a Single Glance#

How It Works#

Architecture and Loss Function#

Strengths and Limitations#

YOLOv2: Better, Faster, Stronger#

YOLOv3: The Multi-Scale Revolution#

Backbone, Neck, and Head#

YOLOv4: The Bag of Tricks#

YOLOv5: PyTorch Era and Scalability#

YOLOX, YOLOv6, YOLOv7#

YOLOX (2021)#

YOLOv6 (2022)#

YOLOv7 (2022)#

YOLOv8: Anchor-Free, Decoupled, and Versatile#

Specialized and AI-Designed YOLOs#

PP-YOLO Series#

YOLO-NAS (2023)#

YOLO with Transformers#

Discussion: Patterns of Evolution#

Trade-off: Speed vs Accuracy#

The Future of YOLO#