Training Design for Text-to-Image Models: Lessons from Ablations

• Training Design for Text-to-Image Models: Lessons from Ablations Welcome back! • This is the second part of our series on training efficient text-to-image models from scratch. • In the first post of this series, we introduced our goal: training a competitive text-to-image foundation model entirely from scratch, in the open, and at scale. • We focused primarily on architectural choices and motivated the core design decisions behind our model PRX. • We also released an early, small (1.2B parameters) version of the model as a preview of what we are building (go try it if you haven’t already 😉). • In this post, we shift our focus from architecture to training.

Article Summaries:

The authors continue their series on building a competitive text‑to‑image foundation model from scratch, shifting focus from architecture to training efficiency. They establish a clean baseline using a 1.2‑billion‑parameter PRX model trained with Flow Matching on a 1 M synthetic dataset, detailing hyperparameters such as 100 k steps, 256×256 resolution, and AdamW optimizer. The post documents a set of recent training tricks-e.g., learning‑rate schedules, regularization, and auxiliary objectives-tested in a consistent setup to assess their impact on convergence and representation quality. Future work will release the full training recipe, a combined “speedrun” configuration, and invite community feedback via Discord.

Sources:

https://huggingface.co/blog/Photoroom/prx-part2