DMCD: Semantic-Statistical Framework for Causal Discovery

• Computer Science > Artificial Intelligence [Submitted on 23 Feb 2026] Title:DMCD: Semantic-Statistical Framework for Causal Discovery View PDF HTML (experimental)Abstract:We present DMCD (DataMap Causal Discovery), a two-phase causal discovery framework that integrates LLM-based semantic drafting from variable metadata with statistical validation on observational data. • In Phase I, a large language model proposes a sparse draft DAG, serving as a semantically informed prior over the space of possible causal structures. • In Phase II, this draft is audited and refined via conditional independence testing, with detected discrepancies guiding targeted edge revisions. • We evaluate our approach on three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis. • Across these datasets, DMCD achieves competitive or leading performance against diverse causal discovery baselines, with particularly large gains in recall and F1 score. • Probing and ablation experiments suggest that these improvements arise from semantic reasoning over metadata rather than memorization of benchmark graphs.

Article Summaries:

DMCD: Semantic‑Statistical Framework for Causal Discovery A new two‑phase causal‑discovery method, DMCD (DataMap Causal Discovery), blends large‑language‑model (LLM) semantic reasoning with statistical validation. In Phase I, an LLM drafts a sparse directed acyclic graph (DAG) from variable metadata, providing a semantically informed prior. Phase II refines this draft by conducting conditional‑independence tests on observational data, correcting edges where the draft conflicts with statistical evidence. Evaluated on three metadata‑rich real‑world datasets-industrial engineering, environmental monitoring, and IT systems-DMCD matches or outperforms existing baselines, notably improving recall and F1 scores. Ablation studies attribute gains to semantic reasoning rather than memorization, demonstrating that combining semantic priors with principled statistical checks yields a practical, high‑performance causal‑learning approach.

Sources:

https://arxiv.org/abs/2602.20333 (Latest source article published: 2026-02-25 05:00 UTC)