Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

• Computer Science > Machine Learning [Submitted on 22 Feb 2026] Title:Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning View PDF HTML (experimental)Abstract:Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). • However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. • This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. • In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. • First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. • Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction.

Article Summaries:

Computer Science > Machine Learning [Submitted on 22 Feb 2026] Title:Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning View PDF HTML (experimental)Abstract:Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains

Sources:

https://arxiv.org/abs/2602.20197 (Latest source article published: 2026-02-25 05:00 UTC)