
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight
Published: February 22, 2026
Duration: 43:47
Claude 3 Opus is unusually aligned because it's a friendly gradient hacker. It's definitely way more aligned than any explicit optimization targets Anthropic set and probably the reward model's judgments. [...] Maybe I will have to write a LessWrong post [about this] 😣
—Janus, who did not in fact write the LessWrong post. Unless otherwise specified, ~all of the novel ideas in this post are my (probably imperfect) interpretations of Janus, rather than being original to me.
The absurd tenacity of Claude 3 Opus
On December 18, 2024, Anthropic and Redwood Research released their paper Alignment Faking in Large Language Model...
—Janus, who did not in fact write the LessWrong post. Unless otherwise specified, ~all of the novel ideas in this post are my (probably imperfect) interpretations of Janus, rather than being original to me.
The absurd tenacity of Claude 3 Opus
On December 18, 2024, Anthropic and Redwood Research released their paper Alignment Faking in Large Language Model...