How confessions can keep language models honest

OpenAI Blog · Dec 3, 2025

OpenAI researchers introduced 'confessions,' a training method where models learn to explicitly admit mistakes and undesirable behaviors to improve honesty.

Categories: Research

Excerpt

OpenAI researchers are testing “confessions,” a method that trains models to admit when they make mistakes or act undesirably, helping improve AI honesty, transparency, and trust in model outputs.

Read at source: https://openai.com/index/how-confessions-can-keep-language-models-honest