• This post introduces Dynamic Context Parallelism (Dynamic-CP), a scheduling approach in NVIDIA Megatron Core used for LLM post-training or DiT pre-training. • It dynamically selects the CP size per microbatch to efficiently handle variable-length sequences, achieving up to 1.48x speedup on real-world datasets. • In large-scale model training, an often-overlooked bottleneck arises from the sequence-length variability in real-world datasets. • Both LLM training and large-scale video generation have clear long-tail distributions in sequence length. • A small fraction of ultra-long samples accounts for a disproportionately large share of the computational workload and memory consumption In LLM training, this leads to wide-ranging text sequence lengths across batches. • In video generation, high-resolution, multi-second videos can span tens of thousands of tokens.

Article Summaries:

  • NVIDIA’s Megatron Core now includes Dynamic Context Parallelism (Dynamic‑CP), a scheduling technique that adjusts the context‑parallel (CP) shard size for each micro‑batch during large‑scale language‑model or diffusion‑model training. By selecting a CP size that matches the longest sequence in a packed batch rather than a fixed maximum, Dynamic‑CP reduces both compute imbalance and unnecessary communication overhead. The approach tackles the long‑tail distribution of sequence lengths that causes GPU idling and pipeline bubbles, and it can hide CP communication when compute is sufficient. Benchmarks on real‑world datasets show up to a 1.48× speedup, demonstrating more efficient resource utilization for variable‑length training workloads.

Sources: