Community Evals: Because we're done trusting black-box leaderboards over the community

• Evaluation metrics saturated; MMLU >91%, GSM8K >94%, yet real‑world tasks still fail. • Inconsistent benchmark scores across papers, model cards, and platforms create no single truth. • Hugging Face Hub introduces decentralized reporting, letting community submit scores via pull requests. • Dataset repos become benchmarks, auto‑aggregating results and displaying leaderboards on dataset cards. • Models store eval scores in .eval_results/*.yaml, visible on model cards and fed into leaderboards. • Verified badges ensure reproducibility, bridging gap between leaderboard claims and community trust.

Article Summaries:

Hugging Face Hub is launching a decentralized evaluation system to address gaps in AI benchmark reporting. Models will store their own evaluation results in .eval_results/*.yaml, which appear on model cards and feed into dataset‑based leaderboards. Dataset repositories can register as benchmarks (e.g., MMLU‑Pro, GPQA, HLE), automatically aggregating scores from across the Hub and displaying them on the dataset card. Community members can submit results via pull requests, linking to papers or logs, and discuss scores openly. The goal is to expose reproducible, community‑verified scores through Hub APIs, improving transparency and reducing the disconnect between benchmark scores and real‑world performance.

Sources:

https://huggingface.co/blog/community-evals