[Can Community Notes Replace Professional Fact-Checkers? 🔗](https://arxiv.org/abs/2502.14132)

The Hidden Backbone - Why Community Notes Need Professional Fact-Checkers

The Hidden Backbone: Why Community Notes Need Professional Fact-Checkers In the ever-evolving landscape of social media, the battle against misinformation has taken a fascinating turn. For years, platforms like Facebook and Twitter (now X) relied on partnerships with professional fact-checking organizations—groups like Snopes, PolitiFact, and Reuters—to flag false claims. However, recently, the tide has shifted toward “community moderation.” The logic seems democratic and scalable: instead of paying a small group of experts to check a massive volume of content, why not empower the users themselves to police the platform? This is the philosophy behind Community Notes on X (formerly Twitter). The idea is that the “wisdom of the crowd” can identify falsehoods faster and more effectively than a newsroom. ...

2025-02 · 8 min · 1680 words
[Call for Rigor in Reporting Quality of Instruction Tuning Data 🔗](https://arxiv.org/abs/2503.04807)

The Hyperparameter Lottery: Why We Might Be misjudging LLM Data Quality

Introduction In the current landscape of Large Language Model (LLM) development, data is the new gold. But not just any data—we are specifically obsessed with Instruction Tuning (IT) data. This is the dataset that turns a raw, text-predicting base model into a helpful chatbot that can answer questions, summarize emails, and write code. A prevailing trend in recent research is “Less is More.” Studies like LIMA (Less Is More for Alignment) have suggested that you don’t need millions of instructions to train a great model; you might only need 1,000 highly curated, high-quality examples. This has triggered a gold rush to find the perfect “high-quality” dataset. Every week, a new paper claims that “Dataset A is better than Dataset B” or that a specific filtering method selects the best data. ...

2025-03 · 10 min · 1954 words
[CHEER-Ekman: Fine-grained Embodied Emotion Classification 🔗](https://arxiv.org/abs/2506.01047)

Decoding the Body Language of Text: How LLMs Learned to Feel

Introduction When you read the phrase “Her heart was racing,” what do you understand? Depending on the context, she could be terrified of a spider, or she could be looking at the love of her life. This is the challenge of Embodied Emotion. Emotions aren’t just abstract concepts in our brains; they are physical experiences. We clench our fists in anger, our stomachs churn in disgust, and our eyes widen in surprise. In Natural Language Processing (NLP), detecting explicit emotions (e.g., “I am happy”) is a solved problem. However, detecting the subtle, physical manifestations of emotion—and correctly classifying them—remains a significant hurdle. ...

2025-06 · 8 min · 1497 words
[BQA: Body Language Question Answering Dataset for Video Large Language Models 🔗](https://arxiv.org/abs/2410.13206)

Can AI Read the Room? Decoding Body Language with the New BQA Dataset

Introduction We’ve all been there: a friend says “I’m fine,” but their crossed arms, avoided eye contact, and stiff posture scream the exact opposite. As humans, a massive chunk of our communication relies on these nonverbal cues. We interpret intent, emotion, and social dynamics not just by listening to words, but by “reading the room.” For Artificial Intelligence, specifically Video Large Language Models (VideoLLMs), this is a frontier that remains largely unconquered. While models like GPT-4o or Gemini are getting remarkably good at describing what objects are in a video, understanding the emotional subtext of human movement is a different beast entirely. Human body language lacks formal rules; it is fluid, culturally dependent, and often unconscious. ...

2024-10 · 9 min · 1891 words
[Automatic detection of dyslexia based on eye movements during reading in Russian 🔗](https://aclanthology.org/2025.acl-short.5.pdf)

Eyes on the Page—Using LSTMs to Detect Dyslexia Through Gaze Patterns

Dyslexia is one of the most common learning disabilities, affecting an estimated 9% to 12% of the population. It is not a visual problem, nor is it related to intelligence; rather, it is a difficulty with phonological decoding—mapping sounds to letters. While the condition is lifelong, early diagnosis is the single most critical factor in ensuring a child stays on track in the educational system. The problem, however, is logistics. Standard testing batteries for dyslexia are expensive, time-consuming, and require one-on-one administration by trained specialists who are not always available in schools. This creates a bottleneck where many children slip through the cracks. ...

7 min · 1447 words
[Are Optimal Algorithms Still Optimal? Rethinking Sorting in LLM-Based Pairwise Ranking with Batching and Caching 🔗](https://arxiv.org/abs/2505.24643)

Why Big O Notation Lies to You: Rethinking Sorting for LLM Re-Ranking

Introduction If you have ever taken a computer science algorithms course, you know the drill. When asked “What is the efficient sorting algorithm?”, the answer is almost reflexively “Merge Sort,” “Heap Sort,” or “Quick Sort.” Why? Because of Big O notation. We are taught that \(O(n \log n)\) is the gold standard for comparison-based sorting, while algorithms like Bubble Sort (\(O(n^2)\)) are relegated to the “never use in production” bin. ...

2025-05 · 9 min · 1773 words
[An Effective Curriculum Learning for Sequence Labeling Incorporating Heterogeneous Knowledge 🔗](https://arxiv.org/abs/2402.13534)

Learning Like a Human: Accelerating Sequence Labeling with Dual-Stage Curriculum Learning

Introduction Imagine teaching a child to read. You wouldn’t start by handing them a complex legal contract or a page from Shakespeare. Instead, you begin with simple sentences: “The cat sat on the mat.” Once they master the basics, you gradually introduce more complex grammar, vocabulary, and sentence structures. This intuitive progression—learning the easy stuff before the hard stuff—is the foundation of Curriculum Learning (CL) in artificial intelligence. In the field of Natural Language Processing (NLP), however, we often ignore this intuition. We tend to train models by feeding them data in random batches, mixing simple phrases with incredibly complex, ambiguous sentences. ...

2024-02 · 8 min · 1582 words
[Acoustic Individual Identification of White-Faced Capuchin Monkeys Using Joint Multi-Species Embeddings 🔗](https://aclanthology.org/2025.acl-short.51.pdf)

Decoding the Jungle: How Bird and Human AI Models Team Up to Identify Monkeys

Imagine standing in the dense tropical forests of Costa Rica. The air is thick with humidity, and the soundscape is a chaotic symphony of insect buzzes, bird calls, wind rustling through leaves, and distant rumbles. In the middle of this acoustic storm (“the cocktail party problem”), a white-faced capuchin monkey calls out. For a human researcher, identifying which specific monkey just made that sound is an arduous task requiring years of training and intense focus. For a computer, it’s even harder. The lack of large, labeled datasets for wild animals has long been a bottleneck in bioacoustics. We have massive datasets for human speech and decent ones for bird calls, but for a specific species of Neotropical primate? The data is scarce. ...

8 min · 1525 words
[Accelerating Dense LLMs via L0-regularized Mixture-of-Experts 🔗](https://aclanthology.org/2025.acl-short.39.pdf)

How to Turn Heavy Dense LLMs into Fast Sparse Experts - A Deep Dive into L0-MoE

Introduction: The Efficiency Bottleneck We are currently living in the era of the “Scaling Law.” The logic that has driven AI progress for the last few years is simple: bigger models equal better performance. Whether it’s Llama-3, Qwen2, or Mistral, increasing the parameter count consistently unlocks new capabilities in reasoning, coding, and general knowledge. However, this intelligence comes with a steep price tag: Inference Latency. Running a massive 70B or even 8B parameter model is computationally expensive. Every time you ask a chatbot a question, the model has to utilize all of its active parameters to generate a response. This leads to slow generation speeds and high operational costs. ...

9 min · 1708 words
[A Variational Approach for Mitigating Entity Bias in Relation Extraction 🔗](https://arxiv.org/abs/2506.11381)

Breaking the Habit: How Variational Information Bottleneck Reduces Entity Bias in Relation Extraction

Introduction Imagine you are reading a financial news headline: “Microsoft invests $10 billion in…” Before you even finish the sentence, your brain probably fills in the blank with “OpenAI.” You didn’t need to read the rest of the text because you relied on your prior knowledge of the entities involved. While this heuristic is useful for humans, it is a significant problem for Artificial Intelligence. In the field of Natural Language Processing (NLP), this phenomenon is known as Entity Bias. Models like BERT or RoBERTa often memorize connections between specific entities (e.g., “Microsoft” and “invest”) rather than understanding the context of the sentence. If the sentence actually read “Microsoft sues OpenAI,” a biased model might still predict an “investment” relationship simply because it over-relies on the names. ...

2025-06 · 7 min · 1394 words
[A Measure of the System Dependence of Automated Metrics 🔗](https://arxiv.org/abs/2412.03152)

Is Your AI Metric Fair? Why We Need to Measure System Dependence in Machine Translation

Imagine you are a carpenter building tables. You have a ruler to measure the length of your work. But this ruler has a strange property: when you measure a table made of oak, an inch is exactly 2.54 centimeters. But when you measure a table made of pine, the ruler magically shrinks, and an “inch” becomes only 2 centimeters. As a result, your pine tables receive inflated measurements, while your oak tables are penalized. ...

2024-12 · 7 min · 1486 words
[A Little Human Data Goes A Long Way 🔗](https://arxiv.org/abs/2410.13098)

The 2.5% Rule: Why Synthetic Data Still Needs a Human Touch

Introduction In the current landscape of Artificial Intelligence, data is the new oil, but the wells are running dry. From the early days of BERT to the massive scale of GPT-4, the growth of Language Models (LMs) has been fueled by an exponential increase in training data. However, we are approaching a critical bottleneck: high-quality, human-annotated data is expensive, slow to produce, and difficult to scale for specialized tasks. Faced with this “data hunger,” researchers and engineers have turned to a promising alternative: Synthetic Data. If modern Large Language Models (LLMs) are so smart, why not ask them to generate the training data for the next generation of models? It sounds like the perfect perpetual motion machine—AI teaching AI, eliminating the need for costly human labor. ...

2024-10 · 10 min · 1929 words
[v-CLR: View-Consistent Learning for Open-World Instance Segmentation 🔗](https://arxiv.org/abs/2504.01383)

Breaking the Texture Bias: How v-CLR Masters Open-World Segmentation

Imagine you show a child a red apple. They learn what an “apple” is. Later, you show them a green apple, or perhaps a plastic toy apple painted blue. The child immediately recognizes it as an apple because they understand its shape and structure, not just its color or texture. Now, try the same experiment with a standard computer vision model. If trained only on red apples, many models will fail spectacularly when presented with a blue one. Why? Because deep neural networks are notoriously lazy: they often cheat by memorizing textures (like the specific shiny red skin) rather than learning the underlying geometry of the object. ...

2025-04 · 9 min · 1861 words
[Encoder-only Mask Transformer: Your ViT is Secretly an Image Segmentation Model 🔗](https://arxiv.org/abs/2503.19108)

Less is More: Why Your Plain Vision Transformer is Already a Segmentation Expert

Introduction In the rapidly evolving world of Computer Vision, there is a prevailing tendency to solve complex problems by adding complexity to architectures. When the Vision Transformer (ViT) burst onto the scene, it revolutionized image classification. However, when researchers tried to apply it to more granular tasks like image segmentation—where the goal is to classify every pixel—the consensus was that the “plain” ViT wasn’t enough. To bridge the gap, the field developed a standard recipe: take a ViT, attach a heavy “adapter” to extract multi-scale features (mimicking Convolutional Neural Networks), add a “pixel decoder” to fuse these features, and finally, cap it off with a complex “Transformer decoder” to generate masks. State-of-the-art models like Mask2Former follow this pattern, becoming powerful but architecturally convoluted and computationally heavy. ...

2025-03 · 9 min · 1889 words
[Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding 🔗](https://arxiv.org/abs/2503.06287)

Unlocking the Hidden Sight of LVLMs: How Frozen Models Can Localize Objects Without Training

Introduction In the rapidly evolving landscape of Artificial Intelligence, Large Vision-Language Models (LVLMs) like LLaVA, GPT-4V, and DeepSeek-VL have become the superstars of multimodal understanding. These models possess an uncanny ability to describe complex scenes, answer questions about images, and even engage in reasoning tasks that were previously thought impossible. However, there has been a persistent gap in their capabilities. While an LVLM can eloquently describe a “red car parked next to a fire hydrant,” asking it to pinpoint the exact pixel coordinates of that car often results in failure or requires significant modifications to the model. This task—identifying the specific region in an image that corresponds to a text description—is known as Visual Grounding. ...

2025-03 · 9 min · 1838 words
[We see it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale 🔗](https://arxiv.org/abs/2412.06699)

See3D: How AI Can Learn 3D Geometry Just by Watching Videos

Introduction How do you understand the three-dimensional structure of the world? You don’t walk around with a ruler and a protractor, measuring the precise coordinates of every object you see. You don’t rely on “gold-standard” 3D mesh data implanted in your brain. Instead, you observe. You move your head, walk around a statue, or drive down a street. Your brain stitches these continuous 2D observations into a coherent 3D model. ...

2024-12 · 9 min · 1836 words
[XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? 🔗](https://arxiv.org/abs/2503.23771)

Can AI Really See the World? Why Remote Sensing is the Final Frontier for Multimodal LLMs

We live in an era where Multimodal Large Language Models (MLLMs) like GPT-4o and Gemini can describe a photo of a cat on a couch with poetic detail. They can explain memes, read charts, and even help fix your coding errors based on a screenshot. But what happens when you ask these same models to look at the world from 20,000 feet up? Imagine asking an AI to analyze a satellite image of a sprawling port city. You don’t just want to know “is this a city?” You want to know: How many ships are docked? Is that traffic jam near the bridge caused by an accident? What is the safest route for a truck to get from the warehouse to the pier? ...

2025-03 · 7 min · 1316 words
[X-Dyna: Expressive Dynamic Human Image Animation 🔗](https://arxiv.org/abs/2501.10021)

Breathing Life into Pixels: Deep Dive into X-Dyna's Dynamic Human Animation

The dream of “Harry Potter”-style moving photographs has been a driving force in computer vision for decades. We want to take a single static photo of a person and animate it using a driving video—making the subject dance, speak, or walk while preserving their identity. While recent advances in diffusion models have made this possible, there is a lingering “uncanny valley” effect in current state-of-the-art methods. You might see a person dancing perfectly, but their hair behaves like a solid helmet, their dress moves like rigid cardboard, and the background remains frozen in time. The person moves, but the dynamics—the physics of wind, gravity, and momentum—are missing. ...

2025-01 · 10 min · 1932 words
[World-consistent Video Diffusion with Explicit 3D Modeling 🔗](https://arxiv.org/abs/2412.01821)

Beyond RGB: How WVD Brings Explicit 3D Consistency to Video Diffusion

The recent explosion in generative AI has given us models capable of dreaming up distinctive images and surreal videos from simple text prompts. We have seen tremendous progress with diffusion models, which have evolved from generating static portraits to synthesizing dynamic short films. However, if you look closely at AI-generated videos, you will often notice a subtle, nagging problem: the world doesn’t always stay “solid.” Objects might slightly warp as the camera moves; the geometry of a room might shift impossibly; or the background might hallucinate new details that contradict previous frames. This happens because most video diffusion models are learning pixel consistency over time, but they don’t inherently understand the 3D structure of the world they are rendering. They are excellent 2D artists, but poor 3D architects. ...

2024-12 · 10 min · 1960 words
[WonderWorld: Interactive 3D Scene Generation from a Single Image 🔗](https://arxiv.org/abs/2406.09394)

Building Infinite 3D Worlds in Seconds: A Deep Dive into WonderWorld

Imagine you are playing a video game or designing a virtual environment. You snap a photo of a street corner, and you want that photo to instantly expand into a fully navigable, infinite 3D world. You want to walk down that street, turn the corner, and see new buildings, parks, and skies generated in real-time, exactly as you imagine them. For years, this has been the “holy grail” of computer vision and graphics. While we have seen massive leaps in generative AI (like Midjourney for 2D images) and 3D reconstruction (like NeRFs and Gaussian Splatting), combining them into a fast, interactive experience has remained elusive. Current methods are typically “offline”—meaning you provide an image, wait 30 minutes to an hour for a server to process it, and get back a static 3D scene. ...

2024-06 · 9 min · 1780 words