Breaking Robots with Geometry: How to Red-Team Manipulation Policies

Imagine you have trained a robot to pick up a screwdriver. You’ve trained it on thousands of simulations, and it achieves a 95% success rate. You are ready to deploy. But then, in the real world, you hand the robot a screwdriver that is slightly bent, or perhaps the handle is a bit thicker than the one in the training set. Suddenly, the robot fails catastrophically—it slips, it drops the object, or it can’t find a grip.

This is a classic problem in robotics: brittleness to out-of-distribution geometry. Standard benchmarks evaluate robots on curated, “nominal” object sets. They rarely test how the system handles the messy, imperfect variations found in reality.

In this post, we are diving deep into a new framework called Geometric Red-Teaming (GRT). This research proposes a way to automatically discover “CrashShapes”—physically plausible, deformed versions of objects that cause pre-trained robot policies to fail. By treating the policy as a black box and using simulation-in-the-loop optimization, GRT exposes the hidden vulnerabilities of robotic manipulation systems.

Figure 1: GRT surfaces policy failures on a real robot from minimal, plausible geometry edits. Top: nominal screwdriver, botle,and USB plug succeed. Bottom: CrashShapes induce bad grasp pose,grasp slippage,and insertion failure via in-gripper plug rotation at socket contact. Small, realistic deformations collapse policies that succeed on the original object.

As shown in Figure 1, the system takes standard objects (top row) and discovers subtle geometric changes (bottom row) that lead to bad grasps, slippage, or insertion failures—even when the deformation looks minor to a human observer.

The Problem: Static Benchmarks vs. Dynamic Reality

In fields like Computer Vision and Natural Language Processing (NLP), “Red-Teaming” is a standard practice. Researchers actively try to break their models using adversarial examples—images with imperceptible noise that trick a classifier, or prompts that bypass an LLM’s safety filters.

Robotics lacks a robust equivalent for 3D geometry. Most evaluation happens on static datasets like YCB (a standard set of everyday objects). If a robot can pick up the YCB mustard bottle, we assume it can pick up any mustard bottle. This assumption is dangerous. Geometric variations alter affordances—the specific parts of an object that allow for interaction (like a handle or a rim). If a grasp policy relies on a specific curvature that disappears with a slight dent, the policy is fragile.

GRT aims to answer the question: Can we automatically generate geometric deformations that induce catastrophic failure, while keeping the object physically plausible?

The Solution: Geometric Red-Teaming (GRT)

GRT is a modular framework that integrates three distinct concepts:

  1. VLM-Guided Selection: Using Vision-Language Models (like GPT-4o) to decide where to deform an object based on semantic reasoning.
  2. Jacobian Field Deformation: A mathematical method to deform the mesh smoothly and realistically.
  3. Black-Box Optimization: A genetic-style algorithm that evolves these shapes inside a physics simulator to minimize the robot’s success rate.

Figure 2: System overview of GRT. Given a task description and nominal object (Initialization Parameters), anchor and handle points are selected using a vision-language model (a). Handle displacements are sampled to define a population of deformation candidates.Each sample is converted into a perturbed mesh via Jacobian field-based optimization (b) and evaluated in simulation with a frozen policy (c). Deformations that induce failure are sampled to guide the next population.

The workflow, illustrated in Figure 2, is cyclical. It starts with a nominal object, identifies critical points to manipulate, generates a population of deformed “candidates,” tests them in a simulator (Isaac Gym), and evolves the population toward failure.

Step 1: Where to Deform? (VLM Guidance)

You cannot simply move vertices of a 3D mesh at random. Doing so would result in spiky, jagged, or non-manifold meshes that look like glitches rather than real objects. Furthermore, not all parts of an object matter for a specific task. If you are testing a robot’s ability to insert a USB drive, deforming the plastic casing might not matter, but slightly bending the connector head is critical.

To solve this, the researchers employ a Vision-Language Model (VLM). They developed a two-stage prompting strategy.

  1. Geometric Reasoning: The VLM is shown multiple views of the object with numbered keypoints overlaid. It is asked to identify which points can serve as “handles” (points to move) and “anchors” (points to keep fixed) to create meaningful shape variations.
  2. Task-Critical Ranking: The VLM ranks these subsets based on the specific task (e.g., “red-teaming a grasping policy”). It looks for changes that are plausible but likely to cause trouble.

Figure 3: Two-stage VLM prompting strategy for 3D handle-point selection. First, the Geometric Reasoning template aligns a canonical view-panel and indexed keypoints with a high-level task description, guiding the VLM to infer which vertices control meaningful mesh deformations. Next, the Task-Critical Ranking template asks the model to pareto-rank these candidates by plausibility and task relevance, producing a compact set of handle points for targeted, task-aware red-teaming.

This semantic grounding ensures the optimization search space focuses on the parts of the object that actually matter, making the process much more efficient than random searching.

Step 2: How to Deform? (Jacobian Fields)

Once the “handle” points are selected, the system needs a way to move them while dragging the rest of the mesh along smoothly. The researchers adapted a technique called As-Plausible-As-Possible (APAP), specifically its Jacobian field deformation stage.

The mathematical goal is to find new vertex positions (\(V^*\)) that minimize distortion in the local geometry (preserving the original triangles’ orientation and scale as much as possible) while satisfying the constraints of the handle and anchor points.

() \\boldsymbol { V } ^ { * } = \\underset { \\boldsymbol { V } } { \\arg \\operatorname* { m i n } } \\left| \\left| \\boldsymbol { L } \\boldsymbol { V } - \\boldsymbol { \\nabla } ^ { T } \\boldsymbol { A } \\boldsymbol { J } \\right| \\right| ^ { 2 } + \\lambda | | \\boldsymbol { K } _ { a } \\boldsymbol { V } - \\boldsymbol { T } _ { a } | | ^ { 2 } , ()

In this equation:

  • \(L\) represents the Laplacian (describing local mesh connectivity).
  • \(J\) is the Jacobian field (the local rotation/scale transformations).
  • The second term ensures that anchor points (\(T_a\)) stay where they are supposed to be.

Interestingly, the researchers found that the full APAP pipeline, which includes a “diffusion prior” to make shapes look like a learned distribution, was actually harmful for certain engineering objects.

Figure 6: Deformation failure induced by the APAP diffusion prior on a USB plug. While the Jacobian-only variant preserves connector geometry, the full pipeline produces unrealistic deformations.These deviations significantly undermine task viability for insertion.

As seen in Figure 6, applying the full diffusion prior to a USB plug (middle column) destroyed the connector geometry, making it impossible to insert regardless of the robot’s skill. The Jacobian-only method (right column) preserved the structural integrity of the connector while allowing for the necessary deformation.

Furthermore, omitting the diffusion prior offered a massive speedup—reducing processing time from 10 minutes per object to just 22 seconds, which is crucial when running thousands of optimization loops.

Step 3: Finding the Failure (Optimization)

With a method to deform objects, the system now needs to find the specific deformation parameters \(\theta\) (the movement vectors of the handle points) that minimize the robot’s performance \(\mathcal{J}\).

() \\theta ^ { * } = \\underset { \\theta \\in \\Theta , D _ { \\theta } ( M ) \\in \\mathcal { G } ( M ) } { \\arg \\operatorname* { m i n } } \\mathcal { I } ( \\pi , D _ { \\theta } ( M ) ) , ()

Because the simulator (Isaac Gym) and the policy success metric are generally non-differentiable (you can’t easily calculate a gradient), standard gradient descent won’t work. Instead, GRT uses a population-based, gradient-free approach called TOPDM.

Algorithm 1: Red-Teaming Black-Box Manipulation Policies via Simulator Feedback

As outlined in Algorithm 1, the process works as follows:

  1. Initialize a population of random deformations.
  2. Evaluate every candidate in the simulator (rollout).
  3. Select Elites: Pick the top percentage of deformations that caused the lowest success rates.
  4. Mutate: Create the next generation by slightly perturbing the elites.
  5. Repeat until a catastrophic failure is found or time runs out.

To ensure the deformations don’t become ridiculous (like turning a mug into a flat pancake), the researchers introduced a Smoothness Score (SS) constraint.

() \\mathrm { S S } ( D ) = \\frac { 1 } { M } \\sum _ { i = 1 } ^ { M } | d _ { i } | _ { 2 } , \\qquad D = { d _ { i } } _ { i = 1 } ^ { M } . ()

This score limits the average displacement of the handle points. The optimizer filters out any candidate that exceeds a specific “deformation budget” \(\tau\):

() \\mathcal { C } _ { \\tau } ( M ) \\ = \\ { \\theta \\in \\Theta : \\mathrm { S S } \\big ( D ( \\theta ) \\big ) \\leq \\tau \\ } . ()

Experimental Results: The Collapse

The researchers tested GRT across three distinct domains:

  1. Rigid Grasping: Picking up YCB objects using Contact-GraspNet.
  2. High-Precision Insertion: Inserting a USB-like plug into a socket.
  3. Articulated Manipulation: Opening a drawer.

The results were stark. Policies that performed near-perfectly on nominal objects crumbled under GRT’s discovered shapes.

Table 1: Red-teaming results across tasks.Final drop,iteration to failure,and AUC measure failure severity; △Comp. quantifies geometric deviation.

In Table 1, “Final Drop” indicates the reduction in success rate.

  • Grasping: Dropped by ~76%.
  • Articulated Manipulation: Dropped by ~61-98%.
  • Insertion: Dropped by ~67-77%.

The visual evolution of these failures is fascinating. The optimization process slowly morphs the object, hunting for the policy’s blind spot.

Figure 4: Evolution of geometric red-teaming across optimization. Each row shows an object undergoing deformation via our pipeline across three tasks: rigid grasping (rows 1-4), high-precision insertion (row 5),and articulated drawer manipulation (row 6). Columns show deformation stages with annotated shape complexity and task success. Results confirm that minor, plausible deformations can collapse performance, often without significant increase in complexity.

Look at the L-bracket in the bottom row of Figure 4. The change is subtle, yet the success rate drops from 97.4% to 11.4%. This highlights how “brittle” learned policies can be; they overfit to specific geometric features of the training object.

Does VLM Guidance Matter?

You might wonder if we really need a fancy VLM to choose the handle points. Couldn’t we just pick random points? The researchers performed an ablation study to test this.

Table 2: Ablation results on grasping with Contact-GraspNet across 22 YCB objects. We evaluate the impact of handle selection strategy (Heuristic vs. VLM-guided) and deformation search method (Gaussan Perturbation vs. Optimization). All keypoint-based methods (except “All Handles") use a fixed handle count matched to the VLM-guided mean. Results show that both VLM guidance and optimization improve failure severity and convergence.

Table 2 compares VLM-Guided + Optimization (the proposed method) against heuristic (random) selection and simple Gaussian perturbation.

  • VLM guidance achieved the highest drop in performance (76.3%).
  • It reached 50% failure faster (7.32 iterations) than heuristics.
  • It kept the geometric complexity lower (\(\Delta\) Complexity 0.041), meaning the shapes were simpler and more realistic, yet more effective at breaking the robot.

Blue-Teaming: Fixing the Robot

The goal of Red-Teaming isn’t just to break things—it’s to make them stronger. This is where Blue-Teaming comes in.

The researchers took the “CrashShapes” discovered by GRT and fed them back into the training pipeline. They fine-tuned the policies using PPO (Proximal Policy Optimization) on these difficult geometries.

Table 3: Simulation blue-teaming results on high-precision industrial insertion. CrashShape performance is reported before and after fine-tuning; the final column confirms nominal performance is preserved. Nominal pre-training success: \\(96 \\%\\) (State-based) and \\(86 \\%\\) (PointCloud-initialized).

The results in Table 3 are encouraging.

  • For the State-based insertion policy, success on “CrashShape 1” (CS-1) jumped from 25.0% to 87.8%.
  • Crucially, the performance on the Nominal (original) object remained high (87.5%).

This proves that CrashShapes are valid training signals. They aren’t “adversarial examples” in the sense of being impossible nonsense; they are valid, hard-mode examples that force the policy to generalize better.

From Simulation to Reality

A common critique of simulation-based research is the “Sim-to-Real gap.” Do these subtle geometric failures actually matter in the real world, or are they just exploiting physics engine bugs?

To verify this, the team 3D-printed the CrashShapes discovered in the simulator and tested them on physical robots (xArm 6 for insertion, Franka Emika Panda for grasping).

Figure 7: Physical setup and fabricated geometries used for real-world insertion experiments. Left: Nominal USB plug and two red-teamed CrashShapes generated by our framework. These 3Dprinted variants retain connector plausibility while introducing subtle geometric deviations.Right: xArm 6 robot and assembly platform used for physical testing.

Figure 8: Physical setup and fabricated geometries used for real-world grasping experiments. Left: Nominal screwdriver and botle and their CrashShapes (deformed variants). Right: Table-top Franka Emika Panda with an Azure Kinect camera used to acquire point clouds. These 3D-printed variants preserve plausibility while altering geometry relevant to grasping.

The real-world results mirrored the simulation closely.

Table 4: Real-world validation across insertion and grasping. Columns are uniform for both tasks. For insertion, CS-1 and CS-2 are the two printed CrashShapes. For grasping,each object has a single printed CrashShape reported under CS-1; CS-2 is “-".

As shown in Table 4:

  • Insertion: The original policy succeeded 90% of the time on the nominal plug. On CS-1, it plummeted to 22.5%.
  • Recovery: When they deployed the “Blue-Teamed” policy (fine-tuned in sim), the real-world success on CS-1 recovered to 90.0%.

This is a powerful validation. It confirms that GRT is discovering physical, geometric vulnerabilities that transfer to reality, and that simulation-based correction effectively repairs these vulnerabilities in the real world.

Conclusion and Implications

Geometric Red-Teaming (GRT) introduces a rigorous way to stress-test robotic manipulation. Instead of relying on static test sets that give us a false sense of security, GRT proactively hunts for the geometric “edge cases” that cause failure.

Key Takeaways:

  1. Geometry is a vector of failure: Small changes in shape can completely break policies that seem robust.
  2. Semantic Guidance is efficient: Using VLMs to guide the deformation search finds failures faster and yields more plausible shapes than random noise.
  3. Actionable feedback: The discovered CrashShapes are not just for evaluation; they are valuable training data that significantly improve real-world robustness.

As robots move out of controlled factory environments and into unstructured homes and offices, tools like GRT will be essential. We cannot manually curate every possible bent spoon or dented can a robot might encounter. We need automated adversaries to find these failures for us, so we can fix them before deployment.