Refusal in Language Models Is Mediated by a Single Direction

· HN · ArXiv ·

Research identifies a single mechanistic direction in LLM activations that controls refusal behavior, offering a tractable target for alignment interventions.

Categories: Research

Excerpt

HN · 77 points · 29 comments

Discussions