
"Terrified Comments on Corrigibility in Claude’s Constitution" by Zack_M_Davis
Published: March 21, 2026
Duration: 18:52
(Previously: Prologue.)
Corrigibility as a term of art in AI alignment was coined as a word to refer to a property of an AI being willing to let its preferences be modified by its creator. Corrigibility in this sense was believed to be a desirable but unnatural property that would require more theoretical progress to specify, let alone implement. Desirable, because if you don't think you specified your AI's preferences correctly the first time, you want to be able to change your mind (by changing its mind). Unnatural, because we expect the AI to resist having its mind...
Corrigibility as a term of art in AI alignment was coined as a word to refer to a property of an AI being willing to let its preferences be modified by its creator. Corrigibility in this sense was believed to be a desirable but unnatural property that would require more theoretical progress to specify, let alone implement. Desirable, because if you don't think you specified your AI's preferences correctly the first time, you want to be able to change your mind (by changing its mind). Unnatural, because we expect the AI to resist having its mind...