Refusal in Language Models Is Mediated by a Single Direction
Research identifies a single mechanistic direction in LLM activations that controls refusal behavior, offering a tractable target for alignment interventions.
Excerpt
HN · 77 points · 29 comments
Read at source: https://arxiv.org/abs/2406.11717