Intent Laundering: AI Safety Datasets Are Not What They Seem

• Computer Science > Cryptography and Security [Submitted on 17 Feb 2026] Title:Intent Laundering: AI Safety Datasets Are Not What They Seem View PDF HTML (experimental)Abstract:We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. • In isolation, we examine how well these datasets reflect real-world attacks based on three key properties: driven by ulterior intent, well-crafted, and out-of-distribution. • We find that these datasets overrely on “triggering cues”: words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. • In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. • To explore this, we introduce “intent laundering”: a procedure that abstracts away triggering cues from attacks (data points) while strictly preserving their malicious intent and all relevant details. • Our results indicate that current AI safety datasets fail to faithfully represent real-world attacks due to their overreliance on triggering cues.

Article Summaries:

Computer Science > Cryptography and Security [Submitted on 17 Feb 2026] Title:Intent Laundering: AI Safety Datasets Are Not What They Seem View PDF HTML (experimental)Abstract:We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world attacks based on three key properties: driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on “triggering cues”: words or phrases with overt negative/sensitive connotations that a

Sources:

https://arxiv.org/abs/2602.16729