The So-fine Real-time ML Paradigm

• Etsy’s CodeMosaic hackathon fuels rapid prototyping of bold tech ideas across teams. • Team tackled stateful and online ML to update models in near real‑time. • Current pipelines rely on 40‑hour batch jobs, delaying weight updates. • Goal: cut retraining costs 45× and boost metrics by 20% per Grubhub study. • Scoped work into three streams: real‑time data ingestion, incremental training, and production updates. • Three‑day sprint required tight focus and cross‑functional collaboration.

Article Summaries:

Etsy’s internal hackathon, CodeMosaic, saw a team tackle the challenge of turning the company’s batch‑based ML pipelines into a real‑time, stateful learning system. The goal was to eliminate the 40‑hour lag between user actions and model updates by pulling data directly from the Beacon Kafka stream and feeding it into an incremental TensorFlow service. Over three days the team split into three sub‑teams: one built a real‑time data pipeline using Kafka SQL and Rivulet, another developed a consumer that loads, updates, and redeploys model weights, and a third planned evaluation metrics. While the data pipeline successfully joined multiple sources, output to a new topic was not achieved, and the incremental learner remained incomplete, the effort demonstrated a clear path toward cost‑saving, near‑real‑time ML at Etsy.
Etsy’s internal hackathon, CodeMosaic, saw a team tackle the challenge of building a real‑time, stateful machine‑learning system. The goal was to enable incremental model updates-both training and online learning-to cut costs and improve performance, following a 2021 Grubhub study that reported up to 45× cost savings and 20% metric gains. The project was split into three streams: generating real‑time training data from Etsy’s Beacon Kafka stream, creating a TensorFlow‑based consumer that updates model weights on the fly, and evaluating the approach against existing batch pipelines. While the team succeeded in joining multiple data sources to produce live training data, full end‑to‑end deployment remained incomplete after the three‑day sprint.
During Etsy’s annual CodeMosaic hackathon, a team set out to create a real‑time machine‑learning platform that would enable stateful training and online model updates. The goal was to replace the existing 40‑hour batch pipeline with a system that could ingest user actions directly from the Beacon Kafka stream, incrementally update TensorFlow models in memory, and serve the updated weights immediately. Over two days the team built a data‑joining prototype using Kafka SQL and Rivulet, overcoming schema and serialization hurdles to produce live training data. While the incremental learning service was still under development, the project demonstrated a viable path toward reducing retraining costs and improving model freshness.
Etsy’s internal hackathon, CodeMosaic, saw a team tackle the challenge of adding stateful, online machine‑learning to the company’s existing batch‑based pipelines. The goal was to enable near‑real‑time model weight updates, potentially cutting training costs by up to 45× and boosting performance, as a 2021 Grubhub study suggested. The project split into three tracks: generating real‑time training data from the Beacon Kafka stream, building a TensorFlow‑based consumer that updates models incrementally, and evaluating the approach against current batch processes. While the team succeeded in joining multiple data sources to produce live training data, full integration and deployment remained incomplete after the three‑day sprint.
Etsy’s internal hackathon, CodeMosaic, saw a team attempt to build a stateful, online machine‑learning system. The goal was to replace 40‑hour batch training cycles with real‑time data ingestion from the Beacon Kafka stream, incremental model updates in TensorFlow, and performance evaluation against existing batch pipelines. Over three days the team split into three sub‑teams: one extracted and joined real‑time training data, another built a service to load, update, and serve models, and a third planned evaluation metrics. While the data‑joining team succeeded in generating live training data, the project did not reach full deployment, highlighting the technical hurdles of real‑time ML at scale.

Sources: