Bring My Cup! ☕
Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Pohang University of Science and Technology (POSTECH)

Imagine asking for your favorite coffee mug while getting ready for work, or having a robot fetch your pet’s specific toy from a pile of similar objects. To be truly useful in daily life, robots must discern the subtle details that distinguish "a cup" from "my cup."

"Put my camera on towel"

"Pick my pen holder"

"Select my leather bag"

"Pick my shaver"

VAP enables frozen VLA models to manipulate user-specific objects among visually similar distractors.

From "A Cup" to "MY Cup"

Standard Vision-Language-Action (VLA) models are excellent at understanding generic concepts like "pick up the cup." However, they fail when users ask for specific instances: "Bring my cup." To a generic VLA, your favorite mug is just "a cup" among many others.

Visual Attentive Prompting (VAP) bridges this gap between semantic understanding and instance-level control. By visually highlighting the target object in the robot's eye, VAP allows existing VLAs to manipulate unseen personal objects immediately—no training, no fine-tuning, just prompting.

Abstract

While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.

Method: Visual Attentive Prompting

VAP Pipeline

VAP is a training-free framework that enables frozen VLA models to understand personalized references. Given a few reference images of a user's object (e.g., "my toy owl"), VAP first grounds the object in the current scene using an open-vocabulary detector and a retriever. It then generates a Visual Prompt by overlaying a semi-transparent red tint on the target object and rewrites the text instruction (e.g., "my toy owl" → "red toy owl"). This explicit visual cue guides the VLA's attention, allowing it to manipulate specific instances among visually similar distractors without any fine-tuning.

Benchmarks Overview

Benchmarks Overview

We establish three benchmarks to rigorously evaluate instance-level manipulation capabilities. In all settings, the robot must identify a user-specific target (defined by ~5 reference images) among visually similar distractors of the same category.

  • Personalized-SIMPLER: Adapted from the SIMPLER simulation benchmark. We evaluate on both Google Robot (Fractal) and WidowX (Bridge) settings, replacing task-relevant objects with high-fidelity 3D assets from Sketchfab.
  • Personalized-VLABench: Extends VLABench to multi-view selection tasks (Franka Emika Panda). Requires handling occlusion and consistency across 3 camera views.
  • Real-world Benchmark: A physical tabletop setup with a SO-101 arm. Includes 8 everyday categories (e.g., vase, slipper, plushie) with unseen instances collected from the real world.

Qualitative Results

We evaluate VAP on three challenging simulation benchmarks. The videos below demonstrate how VAP successfully grounds the target instance and guides the frozen VLA policy to complete the task.

1. Personalized-SIMPLER (WidowX)

"Put my camera on towel"

"Put my owl figurine on plate"

"Pick my shaver"

"Put my straw cup in basket"

2. Personalized-SIMPLER (Google Robot)

"Pick my pen holder" (Visual Matching)

"Pick my pen holder" (Visual Variant)

"Move my bottle near coke can" (Visual Matching)

"Move my bottle near coke can" (Visual Variant)

3. Personalized-VLABench

"Select my leather bag"

"Select my cat figurine"

"Select my cup"

"Select my miniature house"

"Select my shoe"

Quantitative Analysis

WidowX (Bridge)

Table 3 Results

On the WidowX arm using the Bridge dataset, VAP achieves high success rates across all tasks, effectively grounding the user's object even when visually similar distractors are present.

Google Robot (Fractal)

Table 2 Results

VAP significantly outperforms generic policies and text-based prompts. Notably, in the challenging Visual Variant track, VAP maintains a strong success rate of 58.2%.

Personalized-VLABench

Table 4 Results

VAP achieves the highest Success Rate (SR) across all 5 object categories in multi-view selection tasks.

Real-world Performance

Real-world Results

We validate VAP on a physical tabletop setup. VAP (yellow bars) consistently outperforms baselines (blue/green/red).

BibTeX

@misc{lee2025bringcuppersonalizingvisionlanguageaction,
      title={Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting}, 
      author={Sangoh Lee and Sangwoo Mo and Wook-Shin Han},
      year={2025},
      eprint={2512.20014},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.20014}, 
}