Community Evals: Because we're done trusting black-box leaderboards over the community
• Evaluation metrics saturated; MMLU >91%, GSM8K >94%, yet real‑world tasks still fail. • Inconsistent benchmark scores across papers, model cards, and platforms create no single t