Papers

[You Only Look Once: Unified, Real-Time Object Detection 🔗](https://arxiv.org/abs/1506.02640)

YOLO: The Revolution That Made Computer Vision See in Real-Time

When you glance at a photograph, your brain performs a remarkable feat in milliseconds. You don’t just see a collection of pixels—you instantly identify objects, their locations, and their relationships. You notice that a person is walking a dog, a car is parked next to a fire hydrant, or a cat is sleeping on a sofa. For decades, teaching computers to do this with the same speed and accuracy remained a monumental challenge in computer vision. ...

[YOLOv12: Attention-Centric Real-Time Object Detectors 🔗](https://arxiv.org/abs/2502.12524)

YOLOv12: The First Attention-Powered Real-Time Detector That Breaks the CNN Monopoly

For over a decade, the world of real-time object detection has been dominated by one family of models: YOLO (You Only Look Once). From self-driving cars to retail analytics, YOLO’s remarkable balance of speed and accuracy has made it the go-to solution for detecting objects in high-speed, practical applications. Progress within the YOLO ecosystem has been fueled by continual innovation—but nearly all architectural advances have revolved around Convolutional Neural Networks (CNNs). ...

SmartBin: Teaching Trash Cans to Think with Deep Learning

Waste management is one of the most persistent challenges in modern urban life. As cities grow, so do our mountains of trash. The traditional way to deal with this—manual collection and sorting—is labor-intensive, costly, unhygienic, and often inefficient. What if we could bring intelligence to the very first step of waste management: the bin itself? A group of researchers set out to answer that question with SmartBin, a low-cost hardware solution that uses deep learning to automatically segregate garbage at the source. In their paper, “A deep learning approach based hardware solution to categorise garbage in environment”, they present a prototype that can distinguish between biodegradable and non-biodegradable waste, and physically sort it into separate compartments. ...

Teaching Machines to Sort Our Trash: Deep Learning Meets Waste Management

Every year, the world generates over 2 billion tons of municipal solid waste, and this number is projected to soar to 3.4 billion tons by 2050. A vast amount of this waste ends up in landfills, polluting the soil, water, and air. Recycling offers a powerful solution, but success depends on one crucial and often overlooked step: proper waste segregation. Traditionally, sorting waste into categories like paper, plastic, metal, and glass has been a manual, labor-intensive process. It’s slow, costly, and can be hazardous for workers. But what if a machine could sort garbage with the speed and accuracy of a human—possibly even better? ...

ECO-HYBRID: Teaching AI to Sort Trash Better Than Ever

The world is facing a monumental waste problem. As cities expand and populations grow, so does the trash we produce. Projections suggest global waste could swell by 70%, reaching an astonishing 3.4 billion tons by 2050. At the heart of managing this crisis lies a deceptively simple task: sorting waste. Effective sorting is the first crucial step toward recycling, conserving resources, and sustaining a circular economy. But sorting isn’t easy. Traditional methods rely heavily on manual labor—slow, expensive, error-prone, and inadequate for the scale and complexity of today’s waste streams. This is where artificial intelligence (AI) —and specifically deep learning—has shown promise. Convolutional Neural Networks (CNNs) can classify waste images automatically, but they struggle with the messy reality of real-world trash: varied lighting conditions, visually similar materials, and imbalanced categories. ...

ImageNet: The Dataset That Taught Computers to See

In the late 2000s, the internet was overflowing with images. Flickr had billions of photos, Google was indexing countless more, and social media was just beginning its visual explosion. For computer vision researchers, this digital flood was both a tantalizing opportunity and a monumental challenge. On one hand, more data meant the potential to train more powerful and robust models. On the other, this data was a chaotic, unlabeled mess. How could you possibly teach a computer to recognize a Siberian husky when your training data was a jumble of random images tagged “dog”? ...

[Visualizing and Understanding Convolutional Networks 🔗](https://arxiv.org/abs/1311.2901)

Opening the Black Box: How CNNs Actually Learn to See

In 2012, a deep convolutional neural network (CNN) named AlexNet stunned the world by winning the ImageNet Large Scale Visual Recognition Challenge with an error rate almost half that of the runner-up. It was a watershed moment that kicked off the modern deep learning revolution. But while the results were undeniable, these networks were still black boxes—we knew they worked, but not what was happening inside their millions of parameters. ...

How Randomly Dropping Neurons Makes Neural Networks Smarter

If you’ve ever trained a large neural network, you’ve likely encountered its greatest nemesis: overfitting. You watch the training loss plummet, your model perfectly memorizing the training data, only to see its performance on unseen test data stagnate or even worsen. The model has learned the noise, not the signal. It’s like a student who memorizes the answers to a practice exam but fails the real test because they never learned the underlying concepts. ...

LeNet and the Birth of Modern Computer Vision: How 1998's Breakthrough Made CNNs Learn from Pixels

In 1998 a group from AT&T Labs and the Université de Montréal published a paper that became a landmark in machine learning and computer vision: “Gradient-Based Learning Applied to Document Recognition” (LeCun, Bottou, Bengio, Haffner). The paper did two things that matter even more today than they did then: It showed how convolutional neural networks (CNNs) can learn relevant visual features directly from pixels, removing the need for brittle, hand-crafted feature extractors. It introduced a principled way to build and train complex, multi-module systems end-to-end by representing intermediate hypotheses as graphs and backpropagating through them—what the authors call Graph Transformer Networks (GTNs). This article walks through the core ideas, the architectures, the training insights, and the practical systems in the paper. I’ll explain the concepts in a way that aims to be useful for both students and practitioners: not just what the authors did, but why it mattered and what to take away for modern work. ...

[YOLOv4: Optimal Speed and Accuracy of Object Detection 🔗](https://arxiv.org/abs/2004.10934)

YOLOv4: Real-Time Object Detection That Breaks the Speed-Accuracy Trade-Off

In the world of computer vision, object detection is a foundational task with applications ranging from autonomous driving to medical imaging. A persistent challenge in this field has been the trade-off between speed and accuracy. Highly precise models are often too slow for real-time scenarios, while faster models sometimes lack the accuracy needed for mission-critical applications. But what if you could have both? What if you could train a state-of-the-art detector not in a massive, expensive cloud infrastructure, but on a single consumer-grade GPU sitting in your lab or home office? ...