Revisiting Compute Scaling

• Yelp’s node autoscaler evolved from Clusterman to AWS Karpenter to improve scaling efficiency. • Clusterman managed ASG pools via setpoints balancing requested vs allocatable resources. • Frequent scaling made setpoint tuning hard, causing cost or resource ratio trade‑offs. • Clusterman sometimes deleted pods during node removal, leading to unschedulable workloads. • Karpenter offers dynamic provisioning, better cost control, and simplified configuration. • The migration aligns with Yelp’s goal to reduce operational overhead and improve cluster stability.

Article Summaries:

Yelp’s Site Reliability team announced a shift from its internally‑developed Clusterman autoscaler to AWS Karpenter for Kubernetes node scaling. Clusterman, originally built for Mesos and later adapted for Kubernetes, managed node pools via Auto Scaling Groups (ASGs) and maintained a setpoint ratio of requested to allocatable resources. However, the team faced issues such as unstable setpoint tuning, cyclical scaling that left pods unschedulable, difficulty matching pod‑specific instance types, slow interval‑based scaling, and limited flexibility for custom recycling. Karpenter offers faster, more dynamic scaling, better workload‑aware instance selection, and reduced operational overhead, prompting Yelp’s migration.

Sources:

https://engineeringblog.yelp.com/2024/12/revisiting-compute-scaling.html