• Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination View PDF HTML (experimental)Abstract:Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. • Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. • However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. • Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. • MIMIC employs the novel use of vision-language models as linguistic scaffolding to train a conditional variational autoencoder capable of generating inner speech from observations. • A diffusion-based behavior cloning policy then selects actions conditioned on current observations and the generated inner speech.

Article Summaries:

  • Researchers have introduced MIMIC (Modeling Inner Motivations for Imitation and Control), a new framework that uses language as an internal representation of behavioral intent to improve human‑AI coordination. By training a conditional variational autoencoder with vision‑language models, MIMIC generates “inner speech” from observations, which a diffusion‑based behavior‑cloning policy then uses to select actions. This approach allows fine‑grained steering of an agent’s behavior at inference time without requiring additional demonstrations. Experiments on robotic manipulation and human‑AI collaboration games show that MIMIC enhances both the diversity and fidelity of learned behaviors. The authors have released the code and pre‑trained agents publicly.

Sources: