• Author: Paul Calley The landscape of machine learning and artificial intelligence is rapidly expanding, driving an immense demand for robust and scalable training platforms. • As ML/AI applications become more sophisticated and widespread, organizations across all sectors are challenged to efficiently manage and optimize their compute resources. • At Reddit, our ML Training Platform team is at the forefront of this evolution, continuously modernizing our infrastructure to meet these escalating demands. • This post will delve into the architecture of Reddit’s ML Training Platform , detailing how it supports ML teams across the company, including those responsible for ad ranking, content categorization, and the core ranking systems that power the Reddit home feed. • We’ll specifically highlight our integration with Kueue , a quota management and job queuing system, and how it enables us to scale the platform and ensure efficient resource scheduling for ever-increasing ML/AI training needs. • Our Kubernetes Scheduler Evolution for Batch Training The diagram below shows the existing job scheduling flow on the platform, internally named Gazette .

Article Summaries:

  • Reddit’s Machine‑Learning training platform is being upgraded to improve scalability and efficiency. The team has added Kueue, a quota‑management and job‑queuing system, to its existing “Gazette” workflow that uses Airflow, custom CRDs (NodeClass and GazetteRayJob), and the Ray framework. Kueue enforces fair sharing and quotas across the organization, while the Achilles SDK and KubeRay controller translate user requests into RayJobs that run on GKE clusters with autoscaling node pools. This integration addresses previous limitations in on‑demand capacity, deadlocks from partially scheduled workloads, and first‑come, first‑served scheduling, enabling faster, more reliable ML training at scale.

Sources: