Refusal in Language Models Is Mediated by a Single Direction

· Hacker News