Vision-Language Navigation in Human-Populated Environments

HCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation

Haoxuan Xu1,*, Tianfu Li1,*, Wenbo Chen1, Yi Liu2, Jin Wu3, Huashuo Lei1, Yunfan Lou4, Lujia Wang1, Hesheng Wang5, Haoang Li1

1 The Hong Kong University of Science and Technology (Guangzhou) 2 Tsinghua University 3 University of Science and Technology Beijing 4 National University of Singapore 5 Shanghai Jiao Tong University

* Equal contribution.

+14.3% Success Rate over the best baseline on validation-unseen
-34.5% Collision Rate compared with the second-best baseline
82% Overall success rate in physical robot deployment
Illustrative HCSG navigation scenario with semantic and geometric reasoning.
HCSG combines semantic interpretation and geometric forecasting to navigate toward task-relevant people while maintaining safe social distance.

Abstract

Vision-Language Navigation (VLN) has achieved remarkable progress by scaling data and model capacity. However, the assumption of a static environment breaks down in real-world indoor scenarios, where robots inevitably encounter dynamic pedestrians. Existing human-aware approaches typically treat humans as moving obstacles based on implicit visual cues, lacking explicit reasoning about human intentions and social norms.

HCSG is a human-centric framework for VLN that shifts the paradigm from passive collision avoidance to active human behavior understanding. It introduces a unified Human Understanding Module with two complementary capabilities: geometric forecasting, which predicts human pose and trajectory to anticipate future motion, and semantic interpretation, which uses a Vision-Language Model to generate natural-language descriptions of human actions and intentions. These semantic-geometric representations are fused into the agent's topological map for instruction-conditioned planning, while a Social Distance Loss encourages socially compliant interaction.

Method Overview

HCSG injects explicit human-centric reasoning into a waypoint-based VLN policy through parallel geometric and semantic streams.

Overview of the HCSG framework.

At timestep \(t\), HCSG follows waypoint-based VLN: the agent receives panoramic observations \(\mathbb{O}_t = (\mathcal{V}_t^{\mathrm{rgb}}, \mathcal{V}_t^{\mathrm{depth}})\), and an external waypoint predictor \(f_{\mathrm{way}}\) generates candidate nodes \(\mathcal{W}_t\) for the next action.

The visual encoder extracts static node features, and if humans are detected, the waypoint pause is used to collect a short temporal observation sequence:

\[ \mathcal{F}_{t,k}^{\mathrm{static}} = \mathcal{E}_{v}(\mathbb{O}_{t,k}), \quad \mathcal{S} = \langle \mathbb{O}_{\tau} \rangle_{\tau=t}^{t+m-1}. \]

For detected human \(j\), HCSG first constructs the geometric stream from two future-oriented cues. The pose branch estimates body keypoints, while the trajectory branch derives relative motion cues from depth-based projection and predicts future occupancy:

\[ \mathcal{P}_{j}^{\mathrm{pose}} = \left\{\mathbf{K}_{j}^{m}\right\}_{m=1}^{M}, \quad \mathbf{K}_{j}^{m} = \left\{\mathbf{k}_{i,m}^{\mathrm{pose}}\right\}_{i=1}^{17}, \] \[ \mathcal{F}_{t,k,j}^{\mathrm{geo}} = \mathcal{E}_{g}\left( \mathcal{P}_{j}^{\mathrm{pose}}, \mathcal{P}_{j}^{\mathrm{traj}} \right). \]

The semantic stream then uses the cropped human observation and task-specific prompt to obtain a navigation-relevant VLM description, which is encoded as the semantic feature:

\[ \mathcal{F}_{t,k,j}^{\mathrm{sem}} = \mathcal{E}_{s}\left(\mathrm{VLM}(I_{t,k,j}, \mathcal{Q})\right). \]

The geometric and semantic cues are fused with the static scene representation, and the final waypoint action is selected from the resulting human-aware topological representation:

\[ \mathcal{F}_{t,k}^{\mathrm{fused}} = \mathrm{MLP}\left( \mathcal{F}_{t,k}^{\mathrm{static}}, \frac{1}{J}\sum_{j=1}^{J} \left(\mathcal{F}_{t,k,j}^{\mathrm{geo}}+\mathcal{F}_{t,k,j}^{\mathrm{sem}}\right) \right), \] \[ a_{t+1} = \mathrm{FFN}\left( \mathrm{GASA}\left(\mathcal{F}_{t}^{\mathrm{fused}}, \mathcal{W}_{t}, \mathcal{I}\right) \right). \]

Experimental Results

On HA-VLNCE, HCSG improves navigation success while substantially reducing human-related collisions.

Comparison with State-of-the-Art Methods on HA-VLNCE

Models Validation Seen Validation Unseen
TCR ↓ CR ↓ SR ↑ TCR ↓ CR ↓ SR ↑
HA-VLN-CMA 63.09 0.77 0.05 47.06 0.77 0.07
HA-VLN-CMA-DA 17.45 0.61 0.17 27.25 0.69 0.09
HA-VLN-VL 4.44 0.52 0.20 6.63 0.59 0.14
LAW-VLNCE 4.31 0.54 0.21 5.88 0.65 0.15
DUET 4.18 0.48 0.22 5.74 0.63 0.16
ETPNav 4.07 0.43 0.24 6.94 0.58 0.17
GridMM 3.92 0.45 0.24 5.76 0.59 0.18
BEVBert 3.64 0.46 0.27 4.71 0.55 0.21
HCSG (Ours) 3.63 0.34 0.29 5.02 0.36 0.24
Qualitative HA-VLNCE case study where HCSG recognizes relevant human behavior.
HCSG recognizes instruction-relevant human behavior and completes the navigation task.
Qualitative HA-VLNCE case study where HCSG avoids a moving person.
HCSG anticipates dynamic human motion and avoids collision while continuing toward the goal.

Real-World Deployment

Five physical robot scenarios correspond to Fig. 7(a-e), covering stationary interaction targets, pacing or crossing pedestrians, hallway blockage, and blind-corner encounters.

Real-world deployment scenarios for HCSG.
Qualitative real-world deployment on the NXROBO Leo platform across five representative human-populated scenarios.

Fig. 7(a): Interacting Pedestrians

Stop in front of two interacting pedestrians and wait at a socially appropriate distance.

Fig. 7(b): Pacing Pedestrian

Avoid a pacing pedestrian near a pillar through proactive detour planning.

Fig. 7(c): Open Lounge Crossing

Navigate to the target while safely avoiding a moving pedestrian crossing the path.

Fig. 7(d): Hallway Blockage

Identify a stationary human in the hallway and bypass without treating them as the goal.

Fig. 7(e): Blind Corner

Traverse a corridor while yielding to an unseen pedestrian appearing from a blind corner.

Quantitative Results of Real-World Deployment

Real-World Scenario BEVBert HCSG (Ours)
Scenario (a): Stop in front of interacting pedestrians 3/10 (30%) 9/10 (90%)
Scenario (b): Avoid pacing pedestrian on phone 3/10 (30%) 7/10 (70%)
Scenario (c): Navigate to plant, avoid moving human 5/10 (50%) 9/10 (90%)
Scenario (d): Bypass standing human in hallway 6/10 (60%) 9/10 (90%)
Scenario (e): Traverse corridor, avoid corner pedestrian 2/10 (20%) 7/10 (70%)
Overall Average 19/50 (38%) 41/50 (82%)

Citation

Citation information will be updated after publication.

@article{xu2026hcsg,
  title={HCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation},
  author={Xu, Haoxuan and Li, Tianfu and Chen, Wenbo and Liu, Yi and Wu, Jin and Lei, Huashuo and Lou, Yunfan and Wang, Lujia and Wang, Hesheng and Li, Haoang},
  year={2026}
}