Imagine you are a robot butler. Your human asks you to “get a bottle of water from the fridge.” You have a map of the house, and you know how to open a door. You successfully navigate to the kitchen and park in front of the fridge. But there is a problem: you parked six inches too far to the left. Your robotic arm, despite being highly sophisticated, cannot reach the handle at the correct angle to pull the door open. You are stuck. To fix this, you have to move your entire base, but standard navigation systems don’t understand how to position the base to make the arm’s job easier.
This “last-mile” problem is the core challenge of Mobile Manipulation.
In a fascinating new paper, researchers introduce MoTo (Move and Touch), a framework that solves this coordination problem. MoTo acts as a “plug-in” module that connects navigation and manipulation without requiring expensive new training data. It allows robots to perform complex tasks in a zero-shot manner by intelligently coordinating the mobile base with the robotic arm.
In this deep dive, we will explore how MoTo works, the clever mathematics behind its trajectory optimization, and how it leverages Vision-Language Models (VLMs) to “see” where it needs to go.
The Disconnect: Why Mobile Manipulation is Hard
To understand why MoTo is necessary, we first need to look at the current landscape of robotics. Generally, robotic skills are divided into two silos:
- Navigation: Getting from Point A to Point B (e.g., SLAM, path planning).
- Manipulation: Interacting with objects using an arm (e.g., grasping, pouring).
Recent “Foundation Models” in robotics (like OpenVLA or RDT-1B) are incredible at manipulation. They can generalize to new objects and tasks, but they usually assume a fixed base. They can pick up a cup if it’s in front of them, but they can’t move the robot across the room to find the cup.
On the other hand, traditional mobile manipulation approaches often treat navigation and manipulation as separate sequential steps. The robot navigates to a general “docking point” near a target and then attempts to manipulate it. As illustrated in our robot butler example, if the navigation policy doesn’t understand the physical constraints of the arm (interaction-awareness), the robot fails.
Current “End-to-End” solutions try to learn everything at once, mapping camera pixels directly to wheel and arm motor commands. However, collecting the massive datasets required to train these models is prohibitively expensive and time-consuming.
Enter MoTo: The “Plug-and-Play” Solution
MoTo stands for Move and Touch. The researchers propose a modular approach that is “interaction-aware.” Instead of just navigating to “the fridge,” MoTo navigates to a specific position where the arm is mathematically guaranteed to have the best chance of successfully manipulating the fridge handle.

As shown in Figure 1, MoTo is designed to be a plug-in. It takes an off-the-shelf fixed-base manipulation model (like AnyGrasp or OpenVLA) and empowers it with mobility. Crucially, it does this Zero-Shot, meaning it doesn’t need to be trained on thousands of hours of mobile manipulation demonstration data.
The MoTo Pipeline: How It Works
The MoTo framework operates through a sophisticated pipeline that translates a high-level command (like “I am hungry”) into precise motor movements.

The process, illustrated in Figure 2, can be broken down into four distinct stages:
- Scene Understanding & Task Planning: The robot scans the room to build a 3D understanding of the world and uses an LLM to break the user’s command into steps.
- Keypoint Generation: Using Vision-Language Models (VLMs) to identify exactly where to touch the object and which part of the robot should do the touching.
- Interaction-Aware Navigation: Calculating the optimal “docking point” for the base.
- Trajectory Optimization: Smoothing the movement to ensure the robot doesn’t crash or move jerkily.
Let’s break these down in detail.
1. Scene Understanding and Task Planning
Before the robot moves, it needs to understand its environment. The robot scans the area to create a 3D point cloud and a Scene Graph (\(\mathcal{G}\)). A scene graph acts as a structured database of the room: it knows there is a “Table,” and on the “Table” is a “Plate.”
When a user gives a command \(\mathcal{T}\) (e.g., “I am hungry”), an Large Language Model (LLM) parses this instruction using the scene graph to generate a sequence of subtasks.

For example, “I am hungry” might become:
- Move to and pick up Banana (\(o_1\)).
- Move to Plate (\(o_2\)).
- Place Banana on Plate.
2. VLM-Based Keypoint Generation
This is one of the most innovative parts of the paper. Once the robot knows it needs to interact with the “Banana,” it needs to know where to grab it. This is the Target Keypoint (TK). Simultaneously, it needs to know which part of its own body (the gripper? a tool it’s holding?) will make contact. This is the Arm Keypoint (AK).
Since the researchers want to avoid training new models, they use a pre-trained Vision-Language Model (VLM).
Finding the Target Keypoint (TK)
The robot takes multiple images of the target object from different angles. It uses a combination of DINOv2 (for semantic feature extraction) and SAM (Segment Anything Model) to propose potential interaction points on the object.

As seen in Figure 6, the system generates many red dots (proposals). It then asks the VLM to select the best one based on the task description. For example, if the task is “Open the laptop,” the VLM should select a point on the lid, not the keyboard.
To ensure accuracy, MoTo uses Multi-view Voting. Since a single 2D image can be misleading regarding depth, the system projects these keypoints into 3D space from multiple camera views. It then “votes” to find the cluster of points that represents the true interaction spot in 3D space.

The voting mechanism ensures that the selected point is robust and geometrically consistent:

Finding the Arm Keypoint (AK)
Similarly, the system identifies the “Arm Keypoint.” Usually, this is the gripper. However, if the robot is holding a tool (like a broom), the “interaction point” is the tip of the broom, not the gripper itself. The VLM analyzes the wrist camera feed to determine this point dynamically.
3. Keypoint-Guided Trajectory Optimization
Now the robot has a Target Keypoint (TK) on the object and an Arm Keypoint (AK) on itself. The goal of the navigation policy is simple to state but hard to solve: Move the robot base so that the AK touches the TK.
The researchers formulate this as a mathematical optimization problem. They want to find a sequence of base movements (\(\theta^{base}\)) and arm movements (\(\theta^{arm}\)) that minimize the distance between these two points over time (\(T\)).

However, simply minimizing distance isn’t enough. If the robot drives straight at the target, it might smash into a table, or its arm might twist into an impossible singularity. To prevent this, MoTo incorporates a comprehensive cost function (\(\mathcal{C}_t\)).

The cost function is a sum of three critical constraints:

Let’s look at each of these constraints (\(\mathcal{F}\)) individually.
A. Collision Cost (\(\mathcal{F}^c_t\))
Safety is paramount. The robot samples points on its own body surface and calculates the distance to the environmental point cloud (\(P\)). If the distance drops below a safety margin (\(\epsilon_0\)), the cost skyrockets.

B. Smoothness Cost (\(\mathcal{F}^s_t\))
We don’t want the robot to jitter or make erratic movements, which could damage motors or spill items. This cost term penalizes large changes in joint angles or base position between consecutive time steps (\(t\) and \(t+1\)).

C. Margin Cost (\(\mathcal{F}^m_t\))
This is the “special sauce” for interaction-aware navigation. Even if the robot can reach the object, it shouldn’t reach it with its arm fully extended (locked out) or cramped against its chest. It needs a “manipulability margin”—a sweet spot where the arm can move freely to perform the actual grabbing or pouring task.
The margin cost defines an ideal radius for the arm (\(r\)) and penalizes the robot if the arm has to extend too far (\(r_{max}\)) or contract too much (\(r_{min}\)).

The Solver
To solve this optimization problem in real-time, MoTo uses an algorithm called Dual Annealing. This is a global optimization technique that searches for the best trajectory by iteratively “cooling down” the search space, allowing it to escape local minima (like getting stuck in a suboptimal path).

Experiments and Results
The theory sounds solid, but does it work? The researchers tested MoTo in both simulation and the real world.
Simulation: The OVMM Benchmark
They used the Open-Vocabulary Mobile Manipulation (OVMM) benchmark, a rigorous test where robots must find objects, pick them up, and place them elsewhere in simulated home environments.

The results in Table 1 are telling. MoTo significantly outperforms the baseline “Home-Robot” methods.
- Home-Robot (RL): 14.8% Overall Success Rate.
- Home-Robot w/ MoTo: 18.32% Overall Success Rate.
- OpenVLA w/ MoTo: 20.64% Overall Success Rate.
While these numbers might seem low (mobile manipulation is hard!), a 2-6% absolute improvement is substantial in this field. Notably, MoTo achieves this without the massive training data required by the other methods.
Real-World Deployment
Simulation is one thing; the real world is another. The authors deployed MoTo on a physical robot with a wheeled base and dual arms. They tasked it with three complex scenarios:
- “Bring me food”: Pick a fruit and plate it.
- “Serve me water”: Get a cup, dispense water, and serve it.
- “Prepare a meal”: Cook an ingredient in a pan and serve it in a bowl.

Figure 3 shows the results. MoTo (represented by the striped bars) consistently achieved higher success rates than standard baselines. The integration with RDT-1B (a diffusion-based foundation model) proved particularly effective.
The visualizations of the robot in action highlight the smoothness of the generated trajectories.

Ablation Studies: What Matters Most?
The researchers also turned off different parts of the system to see what would happen.
- Turning off Fusion (Voting): Performance dropped significantly. If you rely on a single 2D image to guess 3D depth, you will often miss.
- Turning off Margin Cost: The robot would drive to the object but stop at a distance where the arm was fully extended, making the subsequent pick-up impossible.

Limitations and Failures
No system is perfect. The authors candidly discuss where MoTo fails, providing valuable insights for future research.

Common failure modes include:
- Localization Failure: If the SLAM system drifts, the robot thinks it is in front of the table, but it’s actually 10cm to the left.
- Smoothing Failure: Sometimes the trajectory optimization produces a path that is mathematically sound but physically jerky, triggering the robot’s safety stop.
- Optimization Loops: The algorithm can sometimes get stuck oscillating between two solutions.
Additionally, MoTo relies on the assumption that the target object is visible in the initial scan. It doesn’t yet have a robust exploration policy to “search” for hidden objects dynamically.
Conclusion and Future Implications
MoTo represents a significant step forward in generalized mobile manipulation. By decoupling the “where to stand” problem from the “how to grab” problem, and solving the former with interaction-aware optimization, it allows researchers to leverage powerful, pre-trained manipulation models in mobile settings.
The key takeaways are:
- Modularity works: You don’t always need end-to-end training. Smartly plugging together LLMs, VLMs, and optimization algorithms can yield state-of-the-art results.
- Interaction-Awareness is key: Navigation cannot be blind to the needs of the arm. The base and arm must be treated as a unified kinematic chain.
- Zero-Shot is possible: We can achieve complex behaviors without task-specific fine-tuning, which is essential for deploying robots in diverse, unstructured home environments.
As foundation models continue to improve, “plug-ins” like MoTo will likely become the standard way to mobilize these powerful brains, bringing us one step closer to that helpful robot butler we were promised.
](https://deep-paper.org/en/paper/2509.01658/images/cover.png)