Deliberative alignment: reasoning enables safer language models

OpenAI Blog · Dec 20, 2024

Deliberative alignment trains o1 models to directly reason over safety specifications rather than just pattern-match, representing a new alignment technique for reasoning models.

Categories: Research

Excerpt

Deliberative alignment: reasoning enables safer language models Introducing our new alignment strategy for o1 models, which are directly taught safety specifications and how to reason over them.

Read at source: https://openai.com/index/deliberative-alignment