• Expedia Group Technology - Engineering Colocating Input Partitions with Kafka Streams When Consuming Multiple Topics: Sub-Topology Matters! • Understanding how sub-topology design influences partition co-location Abstract At Expedia Group™, we handle large-scale, near real-time data processing. • We use Kafka as messaging broker and streaming technologies around Kafka to consume and publish messages. • Kafka Streams is one of those framework/library. • This article outlines the behavior of Kafka Streams when consuming from multiple topics (specifically two) with identical partition counts and similar key strategies. • It details our expectations regarding partition assignments, the unexpected distribution behavior we encountered in production, and the architectural changes we implemented to enforce partition colocation - critical for achieving caching efficiency in our use case.

Article Summaries:

  • Expedia Group’s engineering team discovered that Kafka Streams does not guarantee partition colocation across topics when those topics are consumed in separate sub‑topologies. They were processing two topics with identical partition counts and key strategies, expecting the same partition index on both topics to be handled by the same stream instance so a local Guava cache could avoid redundant external API calls. In production, identical keys were routed to different instances, breaking cache locality. The root cause was that each topic lived in its own sub‑topology, so Kafka Streams treated them as independent graphs. The fix was to replace the local cache with a distributed Kafka Streams state store, ensuring consistent key handling across topics.

Sources: