How Reddit Built a LLM Guardrails Platform

• Written by Charan Akiri, with help from Dylan Raithel. • TL;DR We built a centralized LLM Guardrails Service at Reddit to detect & block malicious & unsafe inputs-including prompt injection, jailbreak attempts, harassment, NSFW, & violent content, before they reach downstream language models. • The service operates as a first-line security & safety boundary, returning per-category risk scores & enforcement signals through configurable, client-specific policies. • Today, the system achieves an F1 score of 0.97 with sub-25ms p99 latency and is fully enforcing blocking in production across major Reddit products . Why Did We Build This? • In 2024 we observed a sharp acceleration in LLM adoption across Reddit’s products & internal tooling. • Adoption quickly moved from experimental to mission-critical Reddit assets and flagship products.

Article Summaries:

Reddit has launched a centralized LLM Guardrails Service to detect and block malicious or unsafe inputs-such as prompt injections, jailbreak attempts, harassment, NSFW, and violent content-before they reach downstream language models. The system, deployed across major Reddit products, returns per‑category risk scores and enforcement signals via configurable client‑specific policies. It achieves an F1 score of 0.97 with sub‑25 ms p99 latency and fully enforces blocking in production. The platform was created because existing foundation‑model guardrails failed to address Reddit’s unique threat surface, including high‑volume, linguistically diverse user content and frequent technical queries that caused false positives.

Sources:

https://www.reddit.com/r/RedditEng/comments/1phlj7x/how_reddit_built_a_llm_guardrails_platform/