• ABD benchmark tests default‑exception abduction in finite first‑order logical worlds. • Models generate sparse exception formulas to restore satisfiability under abnormality predicates. • Three observation regimes-closed‑world, existential completion, universal completion-are formally defined. • Exact SMT verification ensures rigorous evaluation of candidate exception definitions. • Ten leading LLMs evaluated on 600 instances, achieving high validity. • Parismony gaps and regime‑specific generalization failures highlight future research challenges.
Article Summaries:
- A new benchmark, ABD: Default Exception Abduction in Finite First‑Order Worlds, was introduced to evaluate how large language models (LLMs) handle logical reasoning with abnormalities. The benchmark supplies a background theory that includes an abnormality predicate and a set of relational structures, and asks models to produce a first‑order formula that defines sparse exceptions, thereby restoring satisfiability. Three observation regimes-closed‑world, existential completion, and universal completion-are formalized and verified with SMT solvers. In tests on 600 instances, ten leading LLMs achieved high validity, yet gaps in parsimony remained, and hold‑out evaluations highlighted distinct generalization failures across the regimes.
Sources: