How safe are gpt-oss-safeguard models?

• Cisco Blogs/Artificial Intelligence - AI/How safe are gpt-oss-safeguard models? • How safe are gpt-oss-safeguard models? • 5 min read Nicholas Conley Large language models (LLMs) have become essential tools for organizations, with open weight models providing additional control and flexibility for customizing models to their specific use cases. • Last year, OpenAI released its gpt-oss series, includingstandardand, shortly after,safeguardvariants, focused on safety classification tasks. • We decided to evaluate their raw security posture against adversarial inputs-specifically,prompt injectionandjailbreaktechniques that use procedures such as context manipulation, and encoding to bypass safety guardrails and elicit prohibited content. • We evaluated four gpt-oss configurations in a black-box environment: the 20b and 120b standard models along with the safeguard 20b and 120b counterparts.

Article Summaries:

OpenAI’s gpt‑oss‑safeguard models were tested in a black‑box setting against prompt‑injection and jailbreak attacks. Four variants-20 b and 120 b standard, plus their safeguard counterparts-were evaluated for single‑turn and multi‑turn success rates. Results show that larger models (120 b) are inherently more resilient, with the standard 120 b variant achieving the lowest overall attack success rate. Safeguard mechanisms offered only mixed benefits; in some cases they added complexity that increased vulnerability. Multi‑turn attacks dramatically raised success rates (up to 8.5×), making them the primary failure mode across all models. The study highlights that model size, not safety classifiers alone, is the key factor in baseline resilience.

Sources:

https://blogs.cisco.com/ai/how-safe-are-gpt-oss-safeguard-models