How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective

• Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective View PDF HTML (experimental)Abstract:Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. • However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. • In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. • To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. • Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. • For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each targeting a fundamental embodied skill.

Article Summaries:

Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective View PDF HTML (experimental)Abstract:Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and a

Sources:

https://arxiv.org/abs/2602.20687 (Latest source article published: 2026-02-25 05:00 UTC)