Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents View PDF HTML (experimental)Abstract:Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. • Benchmarks for these agents must both reliably compare models and yield on-policy training data. • Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. • We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. • Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. • LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints.

Article Summaries:

Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents View PDF HTML (experimental)Abstract:Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate.

Sources:

https://arxiv.org/abs/2602.16246