TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models

• TPRU dataset addresses temporal and procedural gaps in multimodal LLMs, enabling richer embodied AI. • Comprised of robotic manipulation and GUI navigation scenes with 3 tasks: Temporal Reordering, Next-Frame Prediction, Previous-Frame Review. • Includes hard negative samples to force cross‑modal validation beyond passive observation. • Reinforcement learning fine‑tuning boosts 7B model accuracy from 50.3% to 75.7% on TPRU‑Test. • Outperforms larger baselines like GPT‑4o, demonstrating efficiency of smaller, deployable models. • Gains transfer to established benchmarks, proving generalization of temporal reasoning skills.

Article Summaries:

Researchers have released TPRU, a large‑scale multimodal dataset designed to improve temporal and procedural reasoning in deployable large language models. TPRU draws from embodied scenarios such as robotic manipulation and GUI navigation and includes three tasks-Temporal Reordering, Next‑Frame Prediction, and Previous‑Frame Review-alongside challenging negative samples that force models to perform cross‑modal validation. The team fine‑tuned models with reinforcement learning, achieving a 25‑percentage‑point jump in accuracy for a 7‑billion‑parameter model on the TPRU‑Test set (from 50.33 % to 75.70 %). The approach also shows gains on established benchmarks, outperforming larger baselines including GPT‑4o, and the code and data are publicly available.

Sources:

https://arxiv.org/abs/2602.18884