Efficient Visual Representation Learning And Evaluation

• Etsy leverages computer vision to enhance shopping via visual search and recommendations. • Deep neural networks encode images into vector representations, powering downstream tasks at scale. • Training and serving large models is costly; offline training with precomputed embeddings reduces runtime load. • Efficiency matters across architecture, training, evaluation, and serving to meet low‑latency, memory constraints. • EfficientNet family offers scalable CNNs, balancing width, depth, resolution for resource budgets. • Trials began with EfficientNetB0, then scaled to higher accuracy as resources allowed.

Article Summaries:

Etsy is advancing its visual search and recommendation systems by developing efficient image‑representation models. The company initially used EfficientNetB0, a lightweight CNN that delivered good accuracy with low latency. Seeking stronger representations, Etsy experimented with Vision Transformers (ViT), which offer superior performance on large datasets but are compute‑heavy. To balance accuracy and efficiency, Etsy adopted the newer EfficientFormer‑l3 architecture, which combines CNN downsampling with transformer attention only in the final stage, reducing parameters and inference time. This approach allows Etsy to train and serve high‑quality visual embeddings at scale while maintaining the low latency required for real‑time user interactions.
Etsy is upgrading its visual search and recommendation systems by adopting more efficient deep‑learning models. After successful trials with the lightweight EfficientNetB0, the team switched to the newer EfficientFormer‑l3, a hybrid vision transformer that reduces compute and memory usage while improving representation quality. The new architecture downsamples early blocks, applies attention only in the final stage, and aggregates multi‑head outputs for image embeddings. This shift aims to lower inference latency and resource costs, enabling real‑time visual queries at scale. The initiative reflects Etsy’s broader focus on balancing model accuracy with deployment efficiency across training, evaluation, and serving pipelines.
Etsy is upgrading its visual search and recommendation systems by moving from lightweight CNNs to more efficient Vision‑Transformer models. After initial success with EfficientNet‑B0, the company tested Vision Transformers (ViT) for richer representations but found them too slow for real‑time use. It then adopted the newer EfficientFormer‑l3 architecture, which blends CNN down‑sampling with selective attention to keep inference fast while improving accuracy. Models are trained offline and pre‑computed embeddings are served online, reducing latency and memory usage. This shift aims to deliver higher‑quality visual search results without incurring prohibitive infrastructure costs.
Etsy is upgrading its visual search and recommendation systems by adopting more efficient deep‑learning models for image representation. After initial success with the lightweight EfficientNetB0, the team switched to the newer EfficientFormer‑l3 architecture, which blends convolutional downsampling with transformer‑style attention only in the final stage. This design reduces the number of parameters and computational load while maintaining strong performance, addressing the high latency and memory demands of traditional Vision Transformers. The move aims to deliver faster, scalable visual search and similarity recommendations across Etsy’s diverse marketplace, improving user experience without incurring prohibitive infrastructure costs.
Etsy is upgrading its visual search and recommendation systems by adopting more efficient deep‑learning models. After initial success with the lightweight EfficientNet‑B0, the team switched to the newer EfficientFormer‑l3, a hybrid vision‑transformer that reduces attention layers and uses downsampling to keep inference fast and memory‑light. This shift aims to balance high‑quality image embeddings with the low latency required for real‑time product search. The change reflects Etsy’s broader strategy of training large models offline, pre‑computing representations, and deploying them online to improve user experience while managing infrastructure costs.

Sources: