Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

• Computer Science > Artificial Intelligence [Submitted on 29 Jan 2026] Title:Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains View PDF HTML (experimental)Abstract:Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. • Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. • However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. • In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. • Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. • To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework.

Article Summaries:

Computer Science > Artificial Intelligence [Submitted on 29 Jan 2026] Title:Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains View PDF HTML (experimental)Abstract:Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can l

Sources:

https://arxiv.org/abs/2602.13235