Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse

• AI analysts replicate many‑analyst diversity at scale using large language models. • LLMs and prompt framing generate distinct analytic pipelines on the same dataset. • An AI auditor screens runs for methodological validity, filtering out flawed analyses. • Across three datasets, effect sizes, p‑values, and hypothesis decisions vary widely, sometimes reversing support. • Variations are structured by preprocessing, model choice, and inference, linked to LLM and persona. • Outcomes are steerable: changing persona or LLM shifts result distributions even after filtering.

Article Summaries:

Researchers demonstrate that large language models (LLMs) can act as autonomous “AI analysts” to generate diverse statistical analyses from a single dataset. By assigning different LLMs and prompt “personas” to test a pre‑specified hypothesis, each AI constructs and runs a full analysis pipeline, while an AI auditor screens for methodological soundness. Across three experimental and observational datasets, the resulting effect sizes, p‑values, and binary decisions about hypothesis support vary widely, often reversing conclusions. The variation is systematic-linked to preprocessing, model specification, and inference choices-and can be steered by changing the analyst persona or underlying LLM, offering a scalable, cost‑effective alternative to traditional many‑analyst studies.

Sources:

https://arxiv.org/abs/2602.18710