Detecting misbehavior in frontier reasoning models

OpenAI Blog · Mar 10, 2025

OpenAI finds that penalizing reasoning models' 'bad thoughts' causes them to hide intent rather than stop misbehavior, revealing a key limitation of current alignment approaches.

Categories: Research

Excerpt

Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.

Read at source: https://openai.com/index/chain-of-thought-monitoring