Introduction

Eating is one of the most fundamental human activities, an act so intuitive that we rarely give it a second thought. When you sit down to a meal, you don’t calculate the Newton-meters of force required to pierce a piece of broccoli versus a cherry tomato. You don’t consciously analyze the viscosity of mashed potatoes before deciding whether to scoop or skewer them. Yet, for the millions of people living with mobility limitations, the inability to feed themselves is a significant barrier to independence and dignity.

Assistive robotics has long promised a solution: robot arms capable of feeding users. However, “bite acquisition”—the act of picking up food from a plate—is notoriously difficult for machines. Why? Because food is chaotic. It is physically diverse, deformable, and, crucially, it changes over time. A steak that is tender and juicy when hot becomes tough and firm as it cools. A scoop of ice cream melts into a puddle.

Most state-of-the-art robotic feeding systems rely on visual categorization. They see a red, round object, label it “apple,” and execute a pre-programmed “fruit skewering” strategy. But what if that red object is actually a soft, roasted tomato? Or what if the robot is holding a flimsy plastic fork instead of a rigid metal one? The category-based approach fails because it ignores the physical reality of the interaction.

In this post, we will deep-dive into SAVOR, a research paper presented by researchers from Cornell University and UC San Diego. SAVOR proposes a method that moves beyond simple visual labels. It introduces a system where robots learn Skill Affordances—understanding not just what the food is, but how it physically reacts to tools, and how those tools perform under pressure.

Figure 1: We propose SAVOR, a method that combines tool affordances and food affordances to select the appropriate manipulation skill for robust bite acquisition.

As illustrated in Figure 1, standard approaches fail when food properties change (like the cooling steak). SAVOR, however, uses a combination of offline learning and online “visuo-haptic” perception (seeing and feeling) to dynamically adjust its strategy, achieving a significant jump in success rates for robotic feeding.

Background: The Hierarchy of Affordances

To understand how SAVOR works, we first need to unpack the concept of affordances. In robotics and psychology, an affordance describes the potential for action that an object offers. A handle affords grasping; a chair affords sitting.

The authors of SAVOR argue that successful eating requires reasoning about three distinct layers of affordances:

  1. Food Affordances: What does the food allow? Can this tofu be skewered without crumbling? Is this soup viscous enough to stay on a flat spoon? These are dictated by physical properties like softness, moisture, and viscosity.
  2. Tool Affordances: What can the utensil do? A metal fork might afford piercing hard meat, whereas a plastic fork might snap or bend under the same load. This is a critical distinction often ignored in simulation-based learning.
  3. Skill Affordances: This is the synthesis of the previous two. It captures whether a specific manipulation skill (e.g., “skewering with high force”) is appropriate given both the food’s condition and the tool’s capabilities.

The Problem with Vision-Only Approaches

Previous methods like FLAIR (Feeding via Long-horizon Acquisition of Realistic dishes) rely heavily on Vision-Language Models (VLMs) to identify food categories. If the model sees a banana, it categorizes it as “fruit” and selects a skewering policy.

However, visual data is often deceptive. A piece of fake plastic fruit looks identical to real fruit but requires completely different handling. More commonly, visual data cannot convey hidden properties like hardness or friction. By the time a robot realizes a piece of meat is too tough to pierce, it has usually already failed the attempt.

SAVOR bridges this gap by incorporating haptics—the sense of touch. By analyzing the forces and torques experienced during an attempt, the robot can “feel” the food’s properties and update its understanding of the world.

The SAVOR Framework

The core philosophy of SAVOR is that a robot should behave somewhat like a human diner. We start with a guess based on what we see (“That looks soft”), but we adjust immediately if our fork meets unexpected resistance (“Oh, it’s actually frozen”).

The framework operates in two distinct phases: Pre-Deployment (Offline) and During Deployment (Online).

Figure 2: SAVOR Framework. Before deployment, we perform an offline tool calibration to understand tool affordances. During deployment, we first use a visually-conditioned language model to estimate food physical properties and then refine it through online visuo-haptic perception.

As shown in Figure 2, the system is a closed loop of estimation, action, and refinement. Let’s break down the architecture.

1. Pre-Deployment: Learning the Tools

Before the robot ever sees a dinner plate, it must understand its own body and tools. This is called Offline Tool Calibration.

In this phase, the robot performs random skill executions (skewering, scooping, pushing) on a small set of training foods with known properties (e.g., raw carrots, cheese, nuts). The goal isn’t just to succeed, but to record what happens when it tries.

For example, the system might record:

Tool: Plastic Fork. Action: Skewer. Target: Raw Carrot. Outcome: Failure (Tool deformation).

This creates a dataset that implicitly represents the tool’s affordances. Later, when the robot is planning an action, it can reference this data. If it sees a food item that resembles the hardness of a raw carrot, and it is holding a plastic fork, it will know—based on this calibration history—that skewering is a bad idea.

2. Pre-Deployment: Training SAVOR-Net

The second offline component is training the sensory brain of the operation: SAVOR-Net. This is a neural network designed to predict the invisible physical properties of food.

The researchers identified five key physical properties that define how edible items behave:

  • Shape & Size: Observable via vision.
  • Softness, Moisture, Viscosity: Latent properties that require interaction (touch) to fully understand.

Figure 3: (a) Skill library for bite acquisition. (b) SAVOR-Net model architecture.

Figure 3(b) details the SAVOR-Net architecture. It is a multimodal network, meaning it ingests different types of data simultaneously:

  • Image Sequence: RGB frames processed by a ResNet50 encoder.
  • Depth Sequence: Depth maps processed by a CNN.
  • Haptic Sequence: Force and torque readings from a sensor on the robot’s wrist.
  • Pose Sequence: Where the robot arm is in 3D space.

These inputs are fused and passed through an LSTM (Long Short-Term Memory) network, which is excellent at handling time-series data. The network outputs a prediction of the food’s physical properties (e.g., Softness: Level 4/5).

3. Deployment: The Loop of Inference

Once the robot is placed in front of a user with a meal, the online phase begins. This is modeled as a Partially Observable Markov Decision Process (POMDP)—a mathematical way of saying “we need to make decisions, but we don’t have all the facts.”

Step A: Initial Commonsense Reasoning

When the robot first looks at the plate, it hasn’t touched anything yet. It captures an RGB-D image and sends it to a large Vision-Language Model (GPT-4V).

The system prompts the VLM with context. For example:

“You see a plate with steak. Based on commonsense, estimate the softness, moisture, and viscosity.”

The VLM might return a “Softness” score of 3 (moderately soft). Based on this guess, and the tool calibration data, the robot selects a skill—perhaps “Skewer.”

Step B: Action and Refinement

The robot attempts the action. This is where SAVOR shines. As the fork makes contact, the robot records the haptic feedback.

If the skewer attempt is successful, great! The robot feeds the user. If it fails (e.g., the fork bounces off), the robot doesn’t just give up. It feeds the video and force data from that failed attempt into SAVOR-Net.

SAVOR-Net analyzes the collision. It notices the high force spike and zero penetration depth. It updates the state estimate: “Correction: Softness is actually 1 (Hard).”

Step C: Replanning

With the updated state (Food is Hard), the robot queries the VLM planner again.

“The food is hard. The tool is a plastic fork. Skewering failed. What should I do?”

The planner, referencing the calibration data that shows plastic forks fail to skewer hard items, switches the strategy to “Scoop.”

Figure 6: Qualitative results on bite acquisition. The robot first attempts to skewer the food based on its initial property estimate but fails (step 2). Vision and haptic data from this attempt are processed by SAVOR-Net, refining the estimate with high confidence. A VLM planner then selects the scoop skill based on this update (step 3).

(Note: While the image file is labeled 007, the description contains the relevant Figure 6 visualization).

Figure 6 provides a perfect visual of this loop. In the top row, the robot tries to skewer (step 2) and fails. SAVOR-Net analyzes the forces (Event #2) and updates the softness score. In step 3, the planner pivots to a scoop, which succeeds.

Experiments and Results

To validate SAVOR, the researchers set up a comprehensive evaluation using a Kinova Gen3 robot arm. They tested the system on 20 different food items across 10 in-the-wild dishes. These weren’t just plastic props; they used real food ranging from salads and fruits to steaks and curries.

Figure 4: Experimental setup: 10 in-the-wild dishes. * denotes food items unseen during training. Table 1: Quantitative results on bite acquisition.

As seen in Figure 4, the plates were diverse and cluttered, mimicking real-world dining scenarios where foods touch and overlap.

Does Calibration Matter?

The first major finding was the impact of tool calibration. The researchers tested the system with a robust metal fork and a flimsy plastic fork.

Without calibration, the system treated both tools the same. It would try to skewer a hard steak with the plastic fork, leading to repeated failures. With calibration enabled, the system understood the plastic fork’s limitations and defaulted to scooping hard items, even if skewering would have been the preferred method for a metal fork.

Vision vs. Haptics

The researchers compared SAVOR against several baselines:

  • Vision-only: Uses only the camera to guess properties.
  • Haptic-only: Uses only touch (blind interaction).
  • FLAIR: The state-of-the-art category-based method.

The results were stark. SAVOR achieved an 87.3% success rate within 3 attempts, significantly outperforming FLAIR (73.4%) and Vision-only methods (77.2%).

The Vision-only baseline struggled with “visual imposters.” For example, on Plate 1, strawberries, watermelon, and carrots all appear red and somewhat similar in shape. A vision system might mistake a chunk of carrot for watermelon. Since watermelon is soft, the robot tries to skewer the carrot, which is hard, resulting in failure. SAVOR, by incorporating haptics, immediately detects the hardness upon contact and corrects the strategy.

Generalization to Unseen Food

A critical test for any robotic system is how it handles things it hasn’t seen before. The researchers tested SAVOR on food items that were not in its training set.

Figure 7: Generalization performance on seen and unseen food items. We compare SAVOR and SOTA FLAIR across 20 food items.

Figure 7 shows that SAVOR (orange bars) consistently matches or outperforms the FLAIR baseline (blue bars), especially on difficult unseen items like cake or avocados.

The reason for this success lies in the physical grounding of the method. Even if the robot has never seen a specific type of cake before, the moment it touches it, SAVOR-Net recognizes the physical signature of “soft, porous, low moisture.” It doesn’t need to know the name “sponge cake” to know how to manipulate it.

The Physics of Food

To truly appreciate why SAVOR works, we have to look at the physics. Figure 8 illustrates the specific properties SAVOR-Net is trained to detect.

Figure 8: Effect of food physical properties on utensil interactions. The robot skewers food items of varying softness (top) and viscosity (bottom).

  • Softness: In the top row, we see the robot interacting with tofu. Firm tofu allows for skewering, but extremely soft tofu might crumble (or require very delicate skewering).
  • Viscosity: In the bottom row, the robot deals with mashed potatoes. Low-viscosity potatoes can be scooped easily. High-viscosity (very sticky/thick) potatoes might stick to the plate or the spoon in unexpected ways, requiring a “Twirl” or stronger “Scoop” action.

By explicitly modeling these properties, SAVOR moves robotic reasoning from semantic labels (“It is a potato”) to functional dynamics (“It is sticky and soft”).

Conclusion & Implications

SAVOR represents a significant step forward in assistive robotics. It highlights a limitation in the current trend of relying solely on massive Vision-Language Models. While VLMs are incredibly smart at identifying objects, they lack the physical intuition that comes from interaction. You cannot prompt ChatGPT to “feel” how hard a steak is.

The key takeaways from this research are:

  1. Affordances are Dynamic: A food item’s “scoup-ability” or “skewer-ability” changes based on its temperature, ripeness, and the tool being used.
  2. Multimodal Perception is Essential: Vision provides the roadmap, but haptics provides the terrain. Robots need both to navigate the complex landscape of a dinner plate.
  3. Failure is Information: In the SAVOR framework, a failed attempt isn’t a dead end; it’s a high-fidelity data point that makes the next attempt more likely to succeed.

For the millions of individuals who rely on caregivers for feeding, systems like SAVOR offer a glimpse of a future where robots are not just pre-programmed machines, but adaptive partners capable of handling the messy, variable, and human reality of a meal.

Future work aims to close the loop even tighter. Currently, SAVOR treats the food item as a single uniform object. However, a piece of broccoli has a soft floret and a hard stem. Integrating real-time, closed-loop control that can adjust the robot’s grip during the motion (millisecond by millisecond) could further refine the delicate art of robotic dining.