• Author: Cedric Blondeau TL;DR We migrated Reddit’s Home Feed Ranker from CPU to GPU to unlock scalability, efficiency, and enable further growth with new architectures like Transformers. • Outcomes include a 10x reduction in serving costs . • Early research pointed to exponential efficiency gains with Transformer blocks . • To get there, we 1) redesigned the model graph for GPU efficiency and 2) refactored the serving path to eliminate bottlenecks and feed the GPUs with large batches. • Background At Reddit, we’ve been using GPUs to serve Transformer-like models for about a year, mostly LLMs or pre-trained models on the async path, which ran well on GPU out of the box. • Meanwhile, our flagship consumer-side model-the Home Feed ranking model -continued running on CPU.

Article Summaries:

  • Reddit has shifted its Home Feed ranking model from CPU to GPU, cutting serving costs by roughly tenfold and opening the door to newer architectures such as Transformers. The move required a complete overhaul of the model graph to run efficiently on GPUs, including relocating string‑tokenization to a separate preprocessing step and replacing CPU‑bound embedding loops with GPU‑direct Gather operations. By eliminating host‑to‑device data transfers, fusing kernels, and employing CUDA Graphs, the team achieved higher GPU utilization and lower latency. The upgrade not only improves scalability for Reddit’s personalized feed but also enables future experimentation with more complex deep‑learning models.

Sources: