Papers

[YOLOv3: An Incremental Improvement 🔗](https://arxiv.org/abs/1804.02767)

YOLOv3: Engineering Excellence Through Incremental Improvements

In the world of computer vision, the You Only Look Once (YOLO) family of models is legendary. Known for its blistering speed, YOLO redefined real-time object detection by framing it as a single regression problem. But the journey didn’t stop at one great idea. Following the success of YOLO and YOLOv2, the creators returned with a new iteration: YOLOv3. The 2018 paper YOLOv3: An Incremental Improvement is famous not just for its technical contributions but for its refreshingly candid and humorous tone. The authors don’t claim a massive breakthrough—instead, they present a series of thoughtful, practical updates that collectively create a significantly better detector. It’s a masterclass in engineering and the power of incremental progress. ...

[YOLO9000: Better, Faster, Stronger 🔗](https://arxiv.org/abs/1612.08242)

YOLO9000: The Real-Time Detector That Recognizes 9,000 Object Categories

Object detection has long been a cornerstone task in computer vision. We need models that can not only tell us what is in an image, but also where it is. For years, progress meant a trade-off: you could choose a model that was highly accurate or one fast enough for real-time applications—but rarely both. And even the best detectors were limited to a small vocabulary, trained on datasets with a few dozen or at most a few hundred object categories. ...

[You Only Look Once: Unified, Real-Time Object Detection 🔗](https://arxiv.org/abs/1506.02640)

YOLO: The Model That Changed Object Detection with a Single Glance

When you glance at a picture, you instantly recognize the objects within it. You can tell a dog from a bicycle, identify multiple people, and understand where they are in the scene. This ability is effortless for humans, but for computers, it has historically been a monumental challenge. The task, known as object detection, is a cornerstone of computer vision, unlocking capabilities ranging from self-driving cars to assistive technologies and robotics. ...

Teaching AI to Paint Without Ever Seeing a Painting

The rise of generative AI has been nothing short of explosive. Models like Stable Diffusion, Midjourney, and DALL·E can conjure breathtaking images from simple text prompts, democratizing artistic creation in ways we’ve never seen before. But this revolution comes with a controversial side: these powerful models are often trained on vast internet-sourced datasets without the explicit consent of original artists. This practice has sparked fierce debates about copyright, ownership, and the nature of creativity. ...

[CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Models 🔗](https://arxiv.org/abs/2504.08259)

CoProSketch: The AI Sketch Generator That Actually Lets You Edit

Sketches are the soul of visual art. Before an artist commits to a fully rendered painting, they start with a line drawing—a blueprint that captures the essential structure, layout, and proportions of the final piece. This process is intuitive and powerful because editing a sketch is far easier than making pixel-perfect adjustments to a finished color image. Despite the recent explosion in generative AI—particularly diffusion models capable of producing stunning photorealistic images from text—the world of automated sketch generation has been surprisingly quiet. Existing tools often fall short of what artists truly need: precise control over the final output. It’s one thing to generate “a cat sitting on a mat,” but it’s another to specify that the cat should be in the top-left corner, facing right, and be a certain size. ...

[LoRA: Low-Rank Adaptation of Large Language Models 🔗](https://arxiv.org/abs/2106.09685)

LoRA: Fine-Tune Giant AI Models with 10,000× Fewer Parameters

The world of Natural Language Processing (NLP) has been transformed by massive, pre-trained language models like GPT-3. These colossal models, trained on vast portions of the internet, can perform a stunning array of tasks right out of the box. But to unlock their full potential for a specific application—be it a customer service chatbot, a legal document summarizer, or a code generator—we need to adapt them. This process is called fine-tuning. ...

[Adam: A Method for Stochastic Optimization 🔗](https://arxiv.org/abs/1412.6980)

Why Adam Became Deep Learning's Go-To Optimizer

If you’ve ever trained a deep learning model, you’ve almost certainly encountered the Adam optimizer. Since its introduction in 2014, it has become one of the most popular—and often the default—optimization algorithms for training neural networks. But what exactly is Adam? How does it work, and why is it so effective? In this article, we’ll unpack the original paper that introduced Adam: Adam: A Method for Stochastic Optimization by Diederik P. Kingma and Jimmy Lei Ba. We’ll break down the core concepts, walk through the algorithm step-by-step, and explore the experiments that demonstrated its power. Whether you’re a student just starting in machine learning or a practitioner looking for a deeper understanding, this guide will demystify one of the most fundamental tools in the modern deep learning toolkit. ...

[HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning 🔗](https://arxiv.org/abs/2509.08519)

HuMo: Generate Lifelike Human Videos from Text, Photos, and Voice

Imagine being able to direct a short film entirely from your laptop. You provide a photo of an actor, a script for their lines, and a description of the scene — and an AI model generates a high-quality video that brings your vision to life. This is the promise of Human-Centric Video Generation (HCVG), a rapidly evolving field that’s reshaping content creation. Traditionally, producing even a short video is a complex, expensive endeavor involving casting, location scouting, filming, and post-production. Generative AI aims to democratize this process by allowing creators to craft videos from simple multimodal inputs: text for describing scenes and actions, images for defining character identity, and audio for speech. ...

[VLA-Adapter: An Efficient Paradigm for Tiny-Scale Vision-Language-Action Models 🔗](https://arxiv.org/abs/2509.09372)

Small Model, Big Impact: How VLA-Adapter Shrinks Robot Brains by 14×

Imagine a robot that can understand your instructions, see the world around it, and perform complex tasks like: “Pick up the spoon, place it in the cup, then move the cup onto the plate.” This is the promise of Vision-Language-Action (VLA) models—the “brains” behind the next generation of general-purpose robots. Traditionally, building such models involves a brute-force approach: take a massive Vision-Language Model (VLM), pre-train it on colossal datasets of robot data, and fine-tune it for specific tasks. This works, but it has serious downsides: ...

[Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing 🔗](https://arxiv.org/abs/2509.08721)

SAPO: How a Swarm of AI Models Learned 94% Faster by Sharing Experiences

Training large language models (LLMs) is a monumental task. But what happens after the initial pre-training? How do we refine these models—making them better at complex reasoning, following instructions, and avoiding harmful outputs? One of the most powerful techniques for this is Reinforcement Learning (RL), where a model learns through trial and error, much like a person mastering a new skill. However, applying RL to massive models comes with a hefty price tag. It often demands enormous, centralized GPU clusters where models are trained in lockstep. This approach is not only incredibly expensive but also poses significant technical challenges: communication bottlenecks, latency issues, and a reliance on highly specialized, homogeneous hardware. It’s a game largely reserved for a few big players with deep pockets. ...