How safety updates break AI logic

How safety updates break AI logic

Published: April 25, 2026

Duration: 18:38

This episode examines the evolution and technical refinement of large language models, specifically focusing on instruction tuning, temporal behavior shifts, and multi-modal integration. One paper explores how training with human feedback aligns models like InstructGPT with user intent, making them more helpful and truthful than base models. Another study analyzes the internal mechanical changes caused by this tuning, such as how models prioritize instruction verbs and rotate internal knowledge toward specific tasks. However, research into GPT-3.5 and GPT-4 suggests that model performance can drift or degrade over time, particularly in complex reasoning and following formatting constraints. Finally, the introduction of...