• Token-count-based Batching: Faster, Cheaper Embedding Inference for Queries Embedding model inference often struggles with efficiency when serving large volumes of short requests-a common pattern in search, retrieval, and recommendation systems. • At Voyage AI by MongoDB, we call these short requests queries, and other requests are called documents. • Queries typically must be served with very low latency (typically 100-300 ms). • Queries are typically short, and their token-length distribution is highly skewed. • As a result, query inference tends to be memory-bound rather than compute-bound. • Query traffic is pretty spiky, so autoscaling is too slow.

Article Summaries:

  • Voyage AI’s new blog explains how short “query” requests-common in search and recommendation systems-are inefficiently served by traditional batching. Because queries are brief and highly skewed in token length, conventional padding wastes GPU memory and compute, leading to high latency and spiky traffic that resists autoscaling. The team leverages padding‑removal features in engines such as vLLM and SGLang to concatenate active sequences into a single “super‑sequence,” so inference time scales with total token count rather than padded length. They introduce token‑count‑based batching, grouping queries by cumulative token count instead of fixed windows or request counts. This approach yields a 50 % drop in GPU inference latency while using three times fewer GPUs.

Sources: