Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data

*Equal Contribution
HIROL Lab, School of Artificial Intelligence, Shanghai Jiao Tong University
arXiv Code (Comming Soon)
Fourth research result visualization

Abstract

Achieving generalizable manipulation in unconstrained environments requires the robot to proactively resolve information uncertainty, i.e., the capability of active perception. However, existing methods are often confined in limited types of sensing behaviors, restricting their applicability to complex environments. In this work, we formalize active perception as a non-Markovian process driven by information gain and decision branching, providing a structured categorization of visual active perception paradigms. Building on this perspective, we introduce CoMe-VLA, a cognitive and memory-aware vision-language-action (VLA) framework that leverages large-scale human egocentric data to learn versatile exploration and manipulation priors. Our framework integrates a cognitive auxiliary head for autonomous sub-task transitions and a dual-track memory system to maintain consistent self and environmental awareness by fusing proprioceptive and visual temporal contexts. By aligning human and robot hand-eye coordination behaviors in a unified egocentric action space, we train the model progressively in three stages. Extensive experiments on a wheel-based humanoid have demonstrated strong robustness and adaptability of our proposed method across diverse long-horizon tasks spanning multiple active perception scenarios.

Demo

Problem Formulation

Active perception establishes a closed-loop interaction between perception and action, where deliberate actions are executed to resolve task-relevant information uncertainty, and the resulting perceptual outcomes direct the branching of subsequent actions.

Active perception is a non-Markovian decision process (NMDP) that incorporates two core mechanisms: (1) Information Gain; (2) Decision Branching. We also provide paradigms of visual active perception, including Information Discovery (Viewpoint/Manipulation Discovery) and Information Enrichment.
Problem Formulation Diagram

Method

Leveraging large-scale human egocentric data, we propose CoMe-VLA, a cognitive and memory-aware vision-language-action (VLA) framework for learning non-Markovian active perception strategies. Our framework integrates a cognitive auxiliary head for autonomous sub-task transitions and a dual-track memory system to maintain consistent self and environmental awareness by fusing proprioceptive and visual temporal contexts. By aligning human and robot hand-eye coordination behaviors in a unified egocentric action space, we train the model progressively in three stages: (1) Cognitive State Pretraining on Human Data; (2) Cognition-Action Joint Pretraining on Human Data; (3) Robot Fine-tuning on Robot Data.

Experiments

We compare our method againts two categories of baselines: (1) General-purpose VLAs (2) Imitation Learning Policies.

Experiments

We ablate cognition-based task decomposition and memory mechanisms to validate their effectiveness in enhancing active perception performance.

BibTeX

@misc{li2026actsenseactlearning,
      title={Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data}, 
      author={Jialiang Li and Yi Qiao and Yunhan Guo and Changwen Chen and Wenzhao Lian},
      year={2026},
      eprint={2602.04600},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.04600}, 
}