• Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models (Part 1) Authors: Xiao Yang | Senior Staff Machine Learning Engineer; Ang Xu | Principal Machine Learning Engineer; Yao Cheng | Senior Machine Learning Engineer; Yuanlu Bai | Machine Learning Engineer II; Yuan Wang | Machine Learning Engineer II; Sihan Wang | Staff Software Engineer; Ken Xuan | Senior Software Engineer Introduction In the world of large-scale recommendation systems, the “Two-Tower” model architecture has long been the industry standard for the retrieval and lightweight ranking stage. • Its appeal lies in its elegant efficiency: one neural network tower encodes the user, another encodes the item, and at serving time, the ranking score is reduced to a simple dot product between two vectors. • This architectural simplicity allows systems to scan millions of candidates in mere milliseconds, making it the workhorse of modern discovery engines. • However, this efficiency comes at a significant cost in expressiveness. • The Two-Tower architecture inherently struggles to leverage interaction features - complex, high-fidelity signals that capture exactly how a specific user interacts with a specific item (e.g., “User A has clicked on an ad from Advertiser B five times in the last hour”). • Furthermore, it prevents the use of powerful architectural patterns like target attention or early feature crossing , where user and candidate features interact deep within the network layers rath

Article Summaries:

  • A team of machine‑learning engineers at a major ad tech firm announced a redesign of its serving stack to support next‑generation lightweight ranking models that go beyond the traditional Two‑Tower architecture. The new system replaces the dot‑product‑only retrieval stage with a GPU‑based inference layer that can process complex cross‑features and deep interaction signals between users and items. Leveraging the company’s in‑house PyTorch‑compatible inference engine, the redesign integrates heavy neural models into the existing retrieval pipeline while maintaining end‑to‑end latency. The change aims to improve recommendation quality without sacrificing the speed that has made Two‑Tower models the industry standard.

Sources: