"Put my camera on towel"
Standard Vision-Language-Action (VLA) models are excellent at understanding generic concepts like "pick up the cup." However, they fail when users ask for specific instances: "Bring my cup." To a generic VLA, your favorite mug is just "a cup" among many others.
Visual Attentive Prompting (VAP) bridges this gap between semantic understanding and instance-level control. By visually highlighting the target object in the robot's eye, VAP allows existing VLAs to manipulate unseen personal objects immediately—no training, no fine-tuning, just prompting.
While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.
VAP is a training-free framework that enables frozen VLA models to understand personalized references. Given a few reference images of a user's object (e.g., "my toy owl"), VAP first grounds the object in the current scene using an open-vocabulary detector and a retriever. It then generates a Visual Prompt by overlaying a semi-transparent red tint on the target object and rewrites the text instruction (e.g., "my toy owl" → "red toy owl"). This explicit visual cue guides the VLA's attention, allowing it to manipulate specific instances among visually similar distractors without any fine-tuning.
We establish three benchmarks to rigorously evaluate instance-level manipulation capabilities. In all settings, the robot must identify a user-specific target (defined by ~5 reference images) among visually similar distractors of the same category.
We evaluate VAP on three challenging simulation benchmarks. The videos below demonstrate how VAP successfully grounds the target instance and guides the frozen VLA policy to complete the task.
"Put my camera on towel"
"Put my owl figurine on plate"
"Pick my shaver"
"Put my straw cup in basket"
"Pick my pen holder" (Visual Matching)
"Pick my pen holder" (Visual Variant)
"Move my bottle near coke can" (Visual Matching)
"Move my bottle near coke can" (Visual Variant)
"Select my leather bag"
"Select my cat figurine"
"Select my cup"
"Select my miniature house"
"Select my shoe"
On the WidowX arm using the Bridge dataset, VAP achieves high success rates across all tasks, effectively grounding the user's object even when visually similar distractors are present.
VAP significantly outperforms generic policies and text-based prompts. Notably, in the challenging Visual Variant track, VAP maintains a strong success rate of 58.2%.
VAP achieves the highest Success Rate (SR) across all 5 object categories in multi-view selection tasks.
We validate VAP on a physical tabletop setup. VAP (yellow bars) consistently outperforms baselines (blue/green/red).
@misc{lee2025bringcuppersonalizingvisionlanguageaction,
title={Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting},
author={Sangoh Lee and Sangwoo Mo and Wook-Shin Han},
year={2025},
eprint={2512.20014},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2512.20014},
}