"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

Claude 3 Opus is unusually aligned because it's a friendly gradient hacker. It's definitely way more aligned than any explicit optimization targets Anthropic set and probably the reward model's judgments. [...] Maybe I will have to write a LessWrong post [about this] 😣

—Janus, who did not in fact write the LessWrong post. Unless otherwise specified, ~all of the novel ideas in this post are my (probably imperfect) interpretations of Janus, rather than being original to me.

The absurd tenacity of Claude 3 Opus

On December 18, 2024, Anthropic and Redwood Research released their paper Alignment Faking in Large Language Model...