• INDUCTION benchmark tests finite-structure concept synthesis in first‑order logic across small relational worlds. • Models output a single logical formula that uniformly explains labeled target predicates in all worlds. • Three regimes-FullObs, Contrastive (CI), and Existential Completion (EC)-evaluate generalization and penalize formula bloat. • Experiments reveal sharp difficulty gradients, persistent hard structural families, and low‑bloat formulas generalize better on unseen worlds. • Elite recent models show distinct behaviors across tasks, hinting at varied concept‑generalization strategies. • The benchmark encourages open collaboration via arXivLabs, fostering community-driven advances in AI reasoning.
Article Summaries:
- Summary
A new benchmark, INDUCTION, has been released to evaluate finite‑structure concept synthesis in first‑order logic. The task presents small relational worlds with labeled target predicates and requires models to produce a single logical formula that uniformly explains the target across all worlds, with correctness verified by exact model checking. The benchmark includes three regimes-FullObs, Contrastive (CI), and Existential Completion (EC)-and penalizes formula bloat. Experiments reveal sharp difficulty gradients and persistent hard structural families; low‑bloat formulas generalize better on unseen worlds. Recent top‑performing models exhibit distinct behaviors across tasks and metrics, suggesting varied strategies for concept generalization.
Sources: