Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

• Computer Science > Machine Learning [Submitted on 5 Feb 2026] Title:Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO View PDF HTML (experimental)Abstract:Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. • Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. • We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. • First, we establish structural understanding via masked shuffled reconstruction. • Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. • Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO.

Article Summaries:

A new three‑stage curriculum learning framework improves the distillation of chain‑of‑thought (CoT) reasoning from large language models into smaller student models. The method first trains the student to reconstruct masked, shuffled text, building structural understanding. Next, Group Relative Policy Optimization (GRPO) is applied to masked completion tasks, allowing the model to balance accuracy with brevity. Finally, persistent failure cases are identified and the student is guided to rewrite reasoning, again optimized with GRPO. On the GSM8K benchmark, the approach boosts Qwen2.5‑3B‑Base accuracy by 11.29 % while shortening outputs by 27.4 %, outperforming prior distillation and instruction‑tuned variants.
Researchers propose a three‑stage curriculum learning framework to distill chain‑of‑thought (CoT) reasoning from large language models into smaller student models. The method first trains the student to reconstruct masked, shuffled reasoning steps, building structural understanding. Next, Group Relative Policy Optimization (GRPO) is applied to masked completion tasks, letting the model balance accuracy and brevity. Finally, persistent failure cases are targeted for rewriting, again optimized with GRPO. On the GSM8K benchmark, the approach boosts Qwen2.5‑3B‑Base’s accuracy by 11.29 % while shortening outputs by 27.4 %, outperforming instruction‑tuned variants and earlier distillation techniques.

Sources:

https://arxiv.org/abs/2602.17686