AI-written critiques help humans notice flaws
OpenAI trained critique-writing models to identify flaws in summaries; larger models improved more at critiquing than at writing, showing promise for AI-assisted human oversight.
Excerpt
We trained “critique-writing” models to describe flaws in summaries. Human evaluators find flaws in summaries much more often when shown our model’s critiques. Larger models are better at self-critiquing, with scale improving critique-writing more than summary-writing. This shows promise for using AI systems to assist human supervision of AI systems on difficult tasks.
Read at source: https://openai.com/index/critiques