Bring My Cup! ☕
Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Sangoh Lee , Sangwoo Mo^* , Wook-Shin Han^*

^* Equal advising

Pohang University of Science and Technology (POSTECH)

🤗

Imagine asking for your favorite coffee mug while getting ready for work, or having a robot fetch your pet’s specific toy from a pile of similar objects. To be truly useful in daily life, robots must discern the subtle details that distinguish "a cup" from "my cup."

"Put my stuffed toy into the plastic bowl"

"Put my brother's dog figurine and my ornament into the plastic bowl"

"Put my pouch into the plastic bowl"

"Put my cat figurine and my brother's owl figurine into the plastic bowl"

"Put my camera on towel"

"Pick my pen holder"

"Select my leather bag"

"Select my cat figurine"

VAP enables frozen VLA models to manipulate user-specific objects among visually similar distractors.

From "A Cup" to "MY Cup"

Standard Vision-Language-Action (VLA) models are excellent at understanding generic concepts like "pick up the cup." However, they fail when users ask for specific instances: "Bring my cup." To a generic VLA, your favorite mug is just "a cup" among many others.

Visual Attentive Prompting (VAP) bridges this gap between semantic understanding and instance-level control. By visually highlighting the target object in the robot's eye, VAP allows existing VLAs to manipulate unseen personal objects immediately—no training, no fine-tuning, just prompting.

Abstract

While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.

Method: Visual Attentive Prompting

VAP is a training-free framework that enables frozen VLA models to understand personalized references. Given a few reference images of a user's object (e.g., "my toy owl"), VAP first grounds the object in the current scene using an open-vocabulary detector and a retriever. It then generates a Visual Prompt by overlaying a semi-transparent red tint on the target object and rewrites the text instruction (e.g., "my toy owl" → "red toy owl"). This explicit visual cue guides the VLA's attention, allowing it to manipulate specific instances among visually similar distractors without any fine-tuning.

Benchmarks Overview

We establish three benchmarks to rigorously evaluate instance-level manipulation capabilities. In all settings, the robot must identify a user-specific target (defined by ~5 reference images) among visually similar distractors of the same category.

Personalized-SIMPLER: Adapted from the SIMPLER simulation benchmark. We evaluate on both Google Robot (Fractal) and WidowX (Bridge) settings, replacing task-relevant objects with high-fidelity 3D assets from Sketchfab.
Personalized-VLABench: Extends VLABench to multi-view selection tasks (Franka Emika Panda). Requires handling occlusion and consistency across 3 camera views.
Real-world Benchmark: A physical tabletop setup with a SO-101 arm. Includes 8 everyday categories (e.g., vase, slipper, plushie) with unseen instances collected from the real world.

Qualitative Results

We evaluate VAP on three challenging simulation benchmarks. The videos below demonstrate how VAP successfully grounds the target instance and guides the frozen VLA policy to complete the task.

1. Personalized-SIMPLER (WidowX)

"Put my camera on towel"

"Put my owl figurine on plate"

"Pick my shaver"

"Put my straw cup in basket"

2. Personalized-SIMPLER (Google Robot)

"Pick my pen holder" (Visual Matching)

"Pick my pen holder" (Visual Variant)

"Move coke can near my bottle" (Visual Matching)

"Move coke can near my bottle" (Visual Variant)

3. Personalized-VLABench

"Select my leather bag"

"Select my cat figurine"

"Select my cup"

"Select my miniature house"

"Select my shoe"

4. Real-world Examples (SO-101)

Each example shows three synchronized camera views: front (cam1), side (cam3), and wrist (cam2), arranged in that order.

"Point at my cup"

Front view (cam1)

Side view (cam3)

Wrist view (cam2)

"Point at my plushie"

Front view (cam1)

Side view (cam3)

Wrist view (cam2)

"Point at my slipper"

Front view (cam1)

Side view (cam3)

Wrist view (cam2)

"Point at my vase"

Front view (cam1)

Side view (cam3)

Wrist view (cam2)

"Put my plant into the plastic bowl"

Front view (cam1)

Side view (cam3)

Wrist view (cam2)

"Put my pouch into the plastic bowl"

Front view (cam1)

Side view (cam3)

Wrist view (cam2)

"Put my scrubber into the plastic bowl"

Front view (cam1)

Side view (cam3)

Wrist view (cam2)

"Put my stuffed toy into the plastic bowl"

Front view (cam1)

Side view (cam3)

Wrist view (cam2)

5. Multi-user Scenario (SO-101)

"Put my cat figurine and my brother's owl figurine into the plastic bowl"

Front view (cam1)

Side view (cam3)

Wrist view (cam2)

"Put my brother's dog figurine and my ornament into the plastic bowl"

Front view (cam1)

Side view (cam3)

Wrist view (cam2)

Quantitative Analysis

WidowX (Bridge)

On the WidowX arm using the Bridge dataset, VAP achieves high success rates across all tasks, effectively grounding the user's object even when visually similar distractors are present.

Google Robot (Fractal)

VAP significantly outperforms generic policies and text-based prompts. Notably, in the challenging Visual Variant track, VAP maintains a strong success rate of 58.2%.

Personalized-VLABench

VAP achieves the highest Success Rate (SR) across all 5 object categories in multi-view selection tasks.

Real-world Performance

We validate VAP on a physical tabletop setup. VAP (yellow bars) consistently outperforms baselines (blue/green/red).

BibTeX

@misc{lee2025bringcuppersonalizingvisionlanguageaction,
      title={Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting}, 
      author={Sangoh Lee and Sangwoo Mo and Wook-Shin Han},
      year={2025},
      eprint={2512.20014},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.20014}, 
}

Bring My Cup! ☕Personalizing Vision-Language-Action Models with Visual Attentive Prompting

VAP enables frozen VLA models to manipulate user-specific objects among visually similar distractors.

From "A Cup" to "MY Cup"

Abstract

Method: Visual Attentive Prompting

Benchmarks Overview

Qualitative Results

1. Personalized-SIMPLER (WidowX)

2. Personalized-SIMPLER (Google Robot)

3. Personalized-VLABench

4. Real-world Examples (SO-101)

5. Multi-user Scenario (SO-101)

Quantitative Analysis

WidowX (Bridge)

Google Robot (Fractal)

Personalized-VLABench

Real-world Performance

BibTeX

Bring My Cup! ☕
Personalizing Vision-Language-Action Models with Visual Attentive Prompting