Papers

[A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery 🔗](https://arxiv.org/abs/2406.10833)

Beyond Chatbots—How LLMs are Re-Engineering the Scientific Method

Introduction In the last few years, the term “Large Language Model” (LLM) has become synonymous with chatbots that can write emails, debug code, or compose poetry. However, a quiet revolution is happening in a sector far more critical to human progress: the natural sciences. Biology, chemistry, physics, and mathematics are drowning in data. The rate of publication has far outpaced any human’s ability to read, let alone synthesize, new information. Furthermore, scientific data is distinct; it isn’t just English text. It involves molecular graphs, protein sequences, mathematical formulas, and complex imagery. ...

[A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives 🔗](https://arxiv.org/abs/2407.15489)

Lost in Translation? Why Machine Translation Might Be the Secret Weapon of Multilingual AI

Lost in Translation? Why Machine Translation Might Be the Secret Weapon of Multilingual AI If you have been following the explosion of Natural Language Processing (NLP) over the last few years, you are likely familiar with the heavy hitters: BERT, GPT, and T5. These models have revolutionized how machines understand human language. Recently, the focus has shifted toward multilingual models—systems capable of understanding and generating text in dozens, sometimes hundreds, of languages simultaneously. ...

[A Closer Look at Multidimensional Online Political Incivility 🔗](https://aclanthology.org/2024.emnlp-main.827.pdf)

Style vs. Substance: Decoding the Two Faces of Political Toxicity on Social Media

Introduction If you have spent any time on Twitter (now X) during an election season, you know that the discourse can get ugly. But “ugly” is a vague term. Is a tweet containing a swear word directed at a senator the same as a tweet calmly accusing a specific group of people of being “traitors to the country”? For years, content moderation tools and researchers have treated online toxicity as a binary problem: a post is either “safe” or “toxic.” However, a recent research paper titled “A Closer Look at Multidimensional Online Political Incivility” argues that this binary view is insufficient for understanding political communication. ...

[A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution 🔗](https://arxiv.org/abs/2410.21716)

The Stylistic Fingerprint: Solving Authorship Attribution with Bayesian LLMs

Imagine finding a lost manuscript claiming to be a forgotten work by Jane Austen or identifying the anonymous creator behind a coordinated misinformation campaign on social media. These scenarios rely on authorship attribution—the computational science of determining who wrote a specific text based on linguistic patterns. For decades, this field relied on manually counting words or, more recently, fine-tuning heavy neural networks. But a new paper, A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution, proposes a fascinating shift. Instead of training models to classify authors, the researchers leverage the raw, pre-trained probabilistic nature of Large Language Models (LLMs) like Llama-3. ...

[1 + 1 > 2 : Can Large Language Models Serve as Cross-Lingual Knowledge Aggregators? 🔗](https://arxiv.org/abs/2406.14721)

Breaking the Language Barrier: How Aggregating Cross-Lingual Knowledge Makes LLMs Smarter

Introduction Imagine asking a highly intelligent professor a question about the history of the Tang Dynasty. If you ask in English, they give you a vague, slightly inaccurate summary. But if you ask the exact same question in Chinese, they provide a rich, detailed, and factually perfect account. This is the current reality of Large Language Models (LLMs). Despite their reputation as universal knowledge bases, models like GPT-4 or Llama-3 suffer from a phenomenon known as multilingual inconsistency. Their “knowledge” is not stored in a language-agnostic database; it is entangled with the language of the training data. Because the internet contains vastly different information in English than it does in Chinese, Spanish, or Japanese, the model’s ability to answer questions fluctuates wildly depending on the language you use. ...

[YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models 🔗](https://arxiv.org/abs/2409.13592)

Can AI Understand the Joke? Evaluating Satire Comprehension in Vision-Language Models with the YesBut Dataset

Can AI Understand the Joke? Evaluating Satire Comprehension in Vision-Language Models with the YesBut Dataset Artificial Intelligence has made massive strides in seeing and describing the world. Modern Vision-Language (VL) models can look at a photo of a kitchen and list the ingredients on the counter, or look at a street scene and describe the traffic. But can they understand humor? Specifically, can they grasp the biting irony of satire? ...

[Wheeled Lab: Modern Sim2Real for Low-cost, Open-source Wheeled Robotics 🔗](https://arxiv.org/abs/2502.07380)

Democratizing Robot Learning: How Wheeled Lab Brings Modern Sim2Real to Low-Cost Robots

Introduction Imagine watching a $50,000 quadruped robot hike up a mountain trail or a specialized drone race through a complex circuit at champion-level speeds. These feats are awe-inspiring, representing the bleeding edge of robotics. They share a common secret sauce: Sim2Real—training a policy in a high-fidelity simulation and then deploying it into the real world. But here lies the problem: these innovations are often locked behind a paywall of expensive hardware and proprietary software. For the average undergraduate student, researcher on a budget, or robotics hobbyist, accessing the tools required to learn these modern techniques is nearly impossible. You are often left with outdated simulators and basic line-following robots, while the state-of-the-art races ahead. ...

[FlashBack: Consistency Model-Accelerated Shared Autonomy 🔗](https://arxiv.org/abs/2505.16892)

The Fast Lane to Shared Autonomy: How Consistency Models Turn Robots into Real-Time Copilots

Imagine trying to land a spacecraft on the moon or insert a delicate plug into a socket using a robotic arm. These tasks require extreme precision. Now, imagine doing it with a joystick that has a slight delay or feels “mushy.” This is the challenge of teleoperation. Shared autonomy is the solution: a collaborative approach where a human pilot provides high-level intent (via a joystick or VR controller), and an AI “copilot” handles the low-level precision, stabilizing the motion and avoiding collisions. ...

[CLAMP: Crowdsourcing a LArge-scale in-the-wild haptic dataset with an open-source device for Multimodal robot Perception 🔗](https://arxiv.org/abs/2505.21495)

Giving Robots the Sense of Touch: How CLAMP Crowdsourced Haptic Perception

Introduction Imagine you are reaching into a cluttered bag to find your house keys. You can’t really see them, but your fingers brush against cold metal, and you instantly know you’ve found them. Or consider checking if a banana is ripe; looking at the color helps, but a gentle squeeze tells you if it’s mushy or firm. For humans, integrating vision with haptics (the sense of touch) is seamless. For robots, it is an immense challenge. While computer vision has seen an explosion of capabilities due to massive datasets like ImageNet, robotic touch has lagged behind. Robots struggle to differentiate between a plastic apple and a real one, or a metal cup and a ceramic one, simply by looking. They need to feel. ...

[DEQ-MPC: Deep Equilibrium Model Predictive Control 🔗](https://openreview.net/pdf?id=zQXurgHUVX)

Closing the Loop: Merging Neural Networks and Control Solvers with Deep Equilibrium Models

Introduction In the world of robotics, there is a constant tension between flexibility and safety. On one hand, we want robots to use Neural Networks (NNs) to learn complex behaviors, adapt to new environments, and process high-dimensional sensor data. On the other hand, neural networks are often “black boxes”—we can’t easily guarantee they won’t command a drone to fly into a wall. To solve this, roboticists rely on Model Predictive Control (MPC). MPC is a mathematical framework that plans movements by solving an optimization problem at every moment, strictly adhering to safety constraints (like “do not hit obstacle” or “stay within motor limits”). ...

[Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation 🔗](https://arxiv.org/abs/2505.04619)

MAD Skills: How to Teach Robots to See with One Eye or Many

Visual reinforcement learning (RL) has pushed the boundaries of what robots can do, from beating Atari games to performing complex dexterous manipulation. However, a significant gap remains between a robot that performs well in a controlled simulation and one that is robust enough for the real world. A major part of this challenge lies in vision. In the real world, depth perception is crucial. Humans achieve this naturally through binocular vision—using two eyes to triangulate 3D structure. Similarly, robots benefit immensely from multiple camera views. Merging these views creates a richer representation of the world, overcoming occlusions and improving learning speed (sample efficiency). ...

[GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data 🔗](https://arxiv.org/abs/2505.03233)

Can Robots Learn to Grasp Using Only Synthetic Data? A Deep Dive into GraspVLA

Introduction: The Data Bottleneck in Robotics We are currently witnessing a golden age of Artificial Intelligence, driven largely by Foundation Models. In Natural Language Processing (NLP) and Computer Vision (CV), models like GPT-4 and Gemini have achieved staggering capabilities. Their secret weapon? The internet. These models are pre-trained on trillions of tokens of text and billions of images scraped from the web. However, there is one frontier where this formula hasn’t quite worked yet: Robotics. ...

[Multi-Loco: Unifying Multi-Embodiment Legged Locomotion via Reinforcement Learning Augmented Diffusion 🔗](https://arxiv.org/abs/2506.11470)

One Brain, Many Bodies: How Multi-Loco Unifies Robot Control with Diffusion and RL

Introduction Imagine if learning to ride a bicycle immediately made you better at walking on stilts or ice skating. In the biological world, this kind of skill transfer happens constantly; animals adapt their motor control strategies to different terrains and physical changes. In robotics, however, this has remained a distant dream. Typically, if you want to train a quadruped (a four-legged robot dog) and a humanoid (a two-legged robot), you need two completely separate training pipelines. Their bodies are different, their motors are different, and the physics governing their movement are distinct. ...

[Ensuring Force Safety in Vision-Guided Robotic Manipulation via Implicit Tactile Calibration 🔗](https://arxiv.org/abs/2412.10349)

Why Robots Break Doors (And How "SafeDiff" Fixes It with Tactile Diffusion)

Opening a door seems like the simplest task in the world. For a human, it’s effortless: you reach out, grab the handle, and pull. If the door is heavy or the hinge is stiff, your hand automatically adjusts the force and trajectory to follow the door’s natural arc. You don’t even think about it. For a robot, however, this simple act is a geometric nightmare. If a robot’s planned trajectory deviates even slightly from the door’s physical constraints—say, by pulling a centimeter too far to the left—it fights against the hinge. This creates “harmful forces.” In the best-case scenario, the robot fails the task. In the worst case, it rips the handle off the door or burns out its own motors. ...

[Learning Long-Horizon Robot Manipulation Skills via Privileged Action 🔗](https://arxiv.org/abs/2502.15442)

Cheating to Win: How Privileged Actions Teach Robots Complex Skills

Reinforcement Learning (RL) has achieved remarkable things, from beating grandmasters at Go to teaching robots how to run. But if you ask a robot to perform a seemingly simple task—like picking up a credit card lying flat on a table—it often flails. This specific type of problem is known as a “long-horizon, contact-rich” task. To succeed, the robot cannot just close its gripper; it must push the card to the edge of the table, reorient its hand, and then grasp it. This requires a sequence of precise interactions (pushing, sliding, pivoting) where the reward (holding the object) only comes at the very end. Standard RL struggles here because the search space is massive, and randomly stumbling upon this complex sequence is statistically impossible. ...

[Estimating Value of Assistance for Online POMDP Robotic Agents 🔗](https://openreview.net/pdf?id=xzR8rBRgPp)

How Robots Decide to Help: Calculating Value of Assistance in Uncertain Worlds

Introduction Imagine a collaborative robotic scenario in a warehouse. One robot, equipped with a robotic arm, is trying to pick up a specific can of soda from a cluttered shelf. However, its sensors are noisy, and a large box is blocking its line of sight. It knows the can is there, but it can’t pinpoint the location with enough certainty to act safely. Nearby, a second robot with a vacuum gripper is idle. This second robot could move the box, revealing the can and making the first robot’s job significantly easier. ...

[COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping 🔗](https://arxiv.org/abs/2502.08054)

Two Hands Are Better Than One: Mastering Occluded Grasping with COMBO-Grasp

Imagine a flat, thin computer keyboard lying on a desk. You want to pick it up. If you just try to grab it from the top, your fingers will likely hit the desk before you can get a solid grip. The desk “occludes” (blocks) the grasp. So, what do you do naturally? You probably use your non-dominant hand to tilt the keyboard up or brace it, while your dominant hand secures the grip. ...

[SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation 🔗](https://openreview.net/pdf?id=xVDj9uq6K3)

Can Vision-Language Models Navigate a Crowd? Inside the SocialNav-SUB Benchmark

Introduction Imagine you are walking down a busy university hallway between classes. You see a group of students chatting on your left, a professor hurrying towards you on your right, and a janitor mopping the floor ahead. Without consciously thinking, you adjust your path. You weave slightly right to give the group space, you slow down to let the professor pass, and you avoid the wet floor. This dance of “social navigation” is second nature to humans. We effortlessly interpret intentions, social norms, and spatial dynamics. ...

[AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit 🔗](https://arxiv.org/abs/2502.09762)

Can Robots Collaborate with Strangers? Inside the AT-Drone Benchmark

Introduction Imagine a high-stakes search-and-rescue mission in a dense forest. A fleet of drones is scanning the ground below. Suddenly, one drone suffers a battery failure and must return to base. A backup drone is immediately deployed to take its place. In an ideal world, this swap is seamless. The new drone joins the formation, understands the current strategy, and collaborates perfectly with the existing team. But in reality, this is an immense challenge for robotics. Most multi-agent systems are trained to work with specific, pre-defined partners. They rely on “over-training” with a fixed team, developing a secret language of movements and reactions. When a stranger—an “unseen” teammate—enters the mix, the coordination often falls apart. ...

[Robot Operating Home Appliances by Reading User Manuals 🔗](https://openreview.net/pdf?id=wZUQq0JaL6)

Can Robots Learn to Cook by Reading the Manual? Meet ApBot

Introduction Imagine unboxing a new, high-tech rice cooker. It has a dozen buttons, a digital display, and a distinct lack of intuitive design. You, a human, likely grab the user manual, flip to the “Cooking Brown Rice” section, and figure out the sequence of buttons to press. Now, imagine you want your home assistant robot to do this. For a robot, this is a nightmare scenario. Unlike a hammer or a cup, an appliance is a “state machine”—it has hidden internal modes, logic constraints (you can’t start cooking if the lid is open), and complex inputs. Historically, roboticists have had to hard-code these interactions for specific devices. But what if a robot could simply read the manual and figure it out, just like we do? ...