Pixel-level Scene Understanding in One Token:
Visual States Need What-is-Where Composition

Agency for Defense Development (ADD)
CVPR 2026 Workshop · Pixel-level Video Understanding in the Wild
CroBo overview teaser
TL;DR — We propose CroBo, a self-supervised visual representation framework for robotics that encodes both what is in the scene and where it is, all in a single compact token. By reconstructing masked local crops from a global bottleneck token, CroBo learns pixel-level scene composition and achieves state-of-the-art results on robot policy learning benchmarks.

Abstract

For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations.

Method

CroBo architecture

Figure 1. CroBo architecture. A global source view is encoded by a vision backbone into a compact bottleneck token. The decoder then reconstructs the pixel content of heavily masked patches sampled from a local crop of the same frame, using sparse visible hints and the global bottleneck context. The high masking ratio (90%) compels the bottleneck to store precise what-is-where information across the entire observation.

Qualitative Results

Reconstruction Visualisation

Reconstruction visualisation

Figure 2. Reconstruction quality. CroBo accurately restores masked patches across diverse scenes — CLEVR (synthetic objects), DAVIS (natural video), MOSEv2 (complex and crowded video), and Franka Kitchen (robot manipulation) — faithfully preserving both object identity and spatial location from the bottleneck token alone.

Perceptual Straightness Analysis

Perceptual straightness

Figure 3. Perceptual straightness. CroBo produces smoother representation trajectories over time (75.4° mean curvature) compared to DINOv2 (103.28°). Straighter trajectories indicate that the latent space better reflects smooth visual state transitions, benefiting sequential decision-making.

Quantitative Results

Quantitative results

Figure 4. Benchmark comparison. CroBo achieves state-of-the-art performance on Franka Kitchen (best on 4/5 tasks, including +13.6% on Micro Open) and competitive results across DeepMind Control Suite manipulation and locomotion tasks.

Contact

For inquiries, please reach out via email:

Seokmin Lee (First Author) — lsm9434@gmail.com

Byeongju Woo (Corresponding Author) — byeongju@umich.edu

You are welcome to reach us at our institutional (@add.re.kr) addresses as well, but for a faster response, please use the email addresses above.