
Platform Engineering Podcast
byCory O'Daniel, CEO of Massdriver
BusinessCareersNewsTechnology
The Platform Engineering Podcast is a show about the real work of building and running internal platforms — hosted by Cory O’Daniel, longtime infrastructure and software engineer, and CEO/cofounder of Massdriver. Each episode features candid conversations with the engineers, leads, and builders shaping platform engineering today. Topics range from org structure and team ownership to infrastructure design, developer experience, and the tradeoffs behind every “it depends.” Cory brings two decades of experience building platforms — and now spends his time thinking about how teams scale infrastructure without creating bottlenecks or burning out ops. This podcast isn’t about trends. It’s about how pl...
Episodes(40 episodes)
Episode 48
You Need AI Sysadmins Can Trust, With Cribl's Nikhil Mungel
What happens when a non-deterministic AI system is asked to touch production telemetry or generate changes for an SRE pipeline? The cost of being “close enough” can be lost data, downtime, or a security incident.Cribl’s Nikhil Mungel joins Cory to break down what it takes to build AI that sysadmins can actually trust. The conversation digs into harness engineering and the practical guardrails that turn probabilistic models into repeatable, verifiable outcomes. They cover why breaking work into small chunks matters, how validation and testing become the real leverage point for AI-native development, and what “code factorie...
Published: May 13, 2026Duration: 55m 16s
Episode 47
Green CI and Merge Queue Mastery with Trunk’s Eli Schleifer
When a flaky test can stall a merge queue, “just rerun CI” stops scaling fast.Cory talks with Trunk co-founder and CEO Eli Schleifer about the outer loop problems that show up as teams ship more code - especially with AI-assisted development increasing PR volume. They break down what a merge queue is, why logical merge conflicts happen even when individual PRs are green, and how predictive testing helps protect main without forcing constant retesting.Eli also explains how Trunk approaches flaky tests: collecting JUnit results, using quarantines so known flakes don’t block delivery, and fi...
Published: Apr 15, 2026Duration: 49m 35s
Episode 46
AI-Native Ops: Making AI Safe for Production with William Collins
What happens when your “coworker” can generate code and changes faster than your team can review them, and production still has to stay up?William Collins breaks down what AI-Native Ops looks like when you take reliability seriously: where reasoning should stop, where deterministic automation should begin, and how guardrails like compliance checks, version pinning, and controlled workflows keep AI from turning into outage fuel. Cory and William also dig into why context windows and tool sprawl matter in real systems, how protocols like MCP and agent-to-agent communication are shaping day-to-day automation, and why regulated environments can’t adop...
Published: Apr 1, 2026Duration: 1h 3m 0s
Episode 45
Infrastructure as Code's Hidden Problem with Pavlo Baron
Terraform drift, state wrangling, and a growing “tools for tools” stack are still daily work for many platform teams - despite a decade of DevOps talk and cloud maturity. Why does ops automation so often feel like it needs babysitting?Pavlo Baron breaks down where Infrastructure as Code tends to break down in real organizations: manual drift management, low-level state complexity, and a lack of practical abstractions that let developers self-serve without inheriting the entire ops burden.The conversation digs into what a more use-case-driven approach could look like - where teams can choose when to e...
Published: Mar 18, 2026Duration: 57m 35s
Episode 44
Why Extend Went All-In on Serverless Platform Engineering
Billions of requests a month on AWS Lambda can cost less than a single engineer’s laptop budget, but only if the architecture and developer workflow are designed for it.Justin Masse, Senior Platform DevOps Engineer at Extend, shares how Extend committed early to a serverless-first approach and built a platform that prioritizes developer speed and low operational toil. The conversation breaks down what it takes to run active-active, multi-region systems in a serverless world, how the team keeps services small and fast, and why asynchronous, event-driven design changes both reliability and cost.You’ll also...
Published: Mar 4, 2026Duration: 1h 2m 28s
Episode 43
Observability in the AI Era with New Relic's Nic Benders
What happens when nobody wrote the code running in your production environment? As AI-generated software becomes standard practice, platform engineers face a new challenge: operating systems without experts to consult.Nic Benders, Chief Technical Strategist at New Relic, has spent 15 years watching observability evolve from basic server monitoring to understanding complex distributed systems. Now he's tackling the next frontier: how to maintain and operate software when there's no human author to ask why something was built a certain way.The conversation covers the shift from instrumentation being the hard problem to understanding being the bottleneck...
Published: Feb 18, 2026Duration: 50m 37s
Episode 42
Simplicity at Scale: Cleaning House for Platform Teams with Brian Childress
Why do so many “modern” platforms feel slow, fragile, and painful to work on?Platform engineer and fractional CTO Brian Childress joins Cory to discuss how over-engineering, resume‑driven development, and scattered tooling quietly block teams from shipping value. They explore why simplicity is a competitive advantage for platform teams, especially as AI becomes part of everyday development.You’ll learn:How to design a simple platform MVP that developers actually like usingWhat a good local‑to‑prod story looks like (and why it’s the real scaling superpower)Practical ways to onboard humans and AI tools s...
Published: Dec 17, 2025Duration: 40m 46s
Episode 41
Using Feature Flags to Tame Complexity with Mike Zorn
What if changing a single flag could save you from a failed migration, a broken API, or a late-night rollback?Join us as we dive into how feature flags become a practical tool for changing application behavior at runtime, not just toggling UI elements. Cory talks Mike Zorn about real stories from LaunchDarkly and Rippling, covering how teams use flags to ship safely, debug faster, and simplify complex systems.You’ll hear about:Using feature flags to avoid staging overload and ship directly to productionMigrating critical systems and databases with minimal downtime and riskControlling lo...
Published: Dec 3, 2025Duration: 43m 29s
Episode 40
Policy as Code: Kyverno and Securing Kubernetes at Scale with Jim Bugwadia
Most Kubernetes security breaches don't come from zero-day exploits - they come from misconfigurations. While your team runs scanners and reviews reports, containers are already running as root, network policies are missing, and compliance violations are piling up across dozens of repositories.Jim Bugwadia, co-founder and CEO of Nirmata and creator of Kyverno, joins Cory to talk about a different approach: policy as code. Instead of asking developers to remember security best practices across every repo, what if your cluster automatically enforced secure defaults and blocked non-compliant deployments before they ever reached production?You'll learn...
Published: Nov 19, 2025Duration: 42m 21s
Episode 39
Guest Host: Kelsey Hightower - Beyond Pipelines: Infrastructure As Data
Is your Git repo really the source of truth for infrastructure - or just a suggestion?Guest host Kelsey Hightower sits down with Cory O’Daniel to unpack why many teams hit dead ends with CI/CD for provisioning, where GitOps struggles with drift, and when TicketOps helps or hurts. They explore a different model: infrastructure as data with typed contracts, shared artifacts, and workflows that embed policy, validation, and upgrades from the start. You’ll hear practical ways to reduce cognitive load for developers while giving operations reliable control and better day‑2 levers.You’ll learn...
Published: Nov 5, 2025Duration: 48m 51s
Episode 38
Guest Host: Kelsey Hightower - Are CI/CD and GitOps Just Making Things Harder?
What if your production environment had a live, trustworthy blueprint you could zoom in and out of on demand?Kelsey Hightower guest-hosts a candid conversation with Cory about why CI/CD pipelines and GitOps often break down for cloud infrastructure. They explore a simpler operational model: treat infrastructure as data, lean on clear checkpoints instead of rigid “golden paths,” and make production legible for both developers and ops.You’ll learn:Where CI/CD adds friction for infra and what to do insteadWhy GitOps works for apps but hits limits for databases, networks, and multi...
Published: Oct 22, 2025Duration: 30m 18s
Episode 37
Guest Host: Kelsey Hightower — Why IaC Alone Isn’t Enough
Ever wonder why strong Terraform modules still lead to long review queues and fragile pipelines? From hand-built scripts and early data center migrations to cloud sprawl and Kubernetes, configuration management has changed a lot - but the core struggle remains: too many decisions, not enough guardrails. Guest host Kelsey Hightower sits down with Cory O’Daniel to unpack where Infrastructure as Code succeeds and where teams get stuck.What you’ll learn:How to avoid “choice overload” in cloud configs by moving decisions upstreamPractical ways to pair IaC with UX, policies, and SLAs to reduce toilWhen click-op...
Published: Oct 8, 2025Duration: 39m 40s
Episode 36
How to Ship Faster with Feature Flags: Insights from Unleash
Still freezing code before Black Friday and hoping nothing breaks? Feature flags can help you ship smaller, safer changes continuously—without the “big bang” risk or painful rollbacks.Cory O’Daniel talks with Unleash VP of Marketing Michael Ferranti about how modern teams use flags as a core delivery primitive alongside CI/CD and trunk-based development. They dig into kill switches for instant mitigation, progressive rollouts tied to real metrics, and why homegrown “if-statement” systems turn into hidden platforms you didn’t mean to build. They also cover the rising volume of AI‑assisted code and how flags provide th...
Published: Sep 24, 2025Duration: 43m 58s
Episode 35
GraphQL, MCP, and the Future of APIs with Apollo CEO Matt DeBergalis
**UPDATE** - Apollo GraphQL has kindly offered us a few free passes to join them at the GraphQL Summit in San Francisco, October 6-8, 2025. If you are interested in going, the code is: PodcastSummit25What if your API layer could help you ship faster today and make tomorrow’s AI workflows safer and easier to build?Apollo CEO Matt DeBergalis explains how GraphQL became a practical standard for unifying messy backends, why declarative schemas and strong types are the “bedrock” for agentic systems, and where MCP fits when you want agents to call business data safely...
Published: Sep 10, 2025Duration: 43m 7s
Episode 34
Beyond Cracking the Coding Interview with Mike Mroczka
Ever wondered how many “perfect” candidates simply learned the test—or how many great engineers get filtered out by bad interview design? Mike Mroczka, interview coach and ex-Googler, shares what really goes on behind technical hiring and how to navigate it to your advantage.What you’ll learn:How leaked question banks and standardized puzzles can distort hiring signals - and where they still helpPractical ways companies can make interviews fairer and harder to game, both on-site and remoteA balanced take on data structures and algorithms: when they’re useful and when they’re noiseTactics to spot and red...
Published: Aug 20, 2025Duration: 1h 8m 35s
Episode 33
From React to Dagster: Pete Hunt on Data, Infra, and AI-Ready Platforms
Is Postgres actually a better message queue than Kafka? This provocative question is just one of many insights Pete Hunt shares in this conversation about data orchestration, platform engineering, and the evolution of infrastructure.Pete Hunt, CEO of Dagster Labs and former React co-founder at Facebook, brings his unique perspective from working at tech giants like Instagram and Twitter to discuss how different platform team approaches impact product development. Having witnessed both Facebook's clear delineation between product and infrastructure teams and Twitter's DevOps-style ownership model, Pete offers valuable comparisons of these contrasting philosophies.The conversation...
Published: Jul 30, 2025Duration: 49m 32s
Episode 32
Building Better Platforms with Dapr: Abstractions, Portability, and Durable Systems with Mark Fussell
Cloud lock-in isn't just about where your data lives—it's about how deeply cloud-specific code permeates your applications. Mark Fussell, co-creator of Dapr and CEO of Diagrid, joins Cory O'Daniel to explore how Dapr provides clean abstractions for common distributed system patterns, enabling teams to build portable applications without sacrificing cloud-native capabilities.The conversation covers:How Dapr creates a clean separation between application code and underlying infrastructure services like messaging, state management, and secretsWhy platform teams struggle with tight coupling between applications and infrastructure, and how Dapr solves this problemThe benefits of Dapr's sidecar architecture for lo...
Published: Jul 16, 2025Duration: 48m 39s
Episode 31
What CVEs Did for Security, CREs Are Doing for Reliability
Did you know that software engineers often "learn things the hard way" because they lack a standardized system to share knowledge about reliability issues? While security professionals have CVEs to catalog vulnerabilities, reliability engineers have been left to reinvent the wheel with each new bug or outage.Tony Meehan, co-founder and CTO of Prequel, introduces us to Common Reliability Enumerations (CREs) - an open-source approach that's doing for reliability what CVEs did for security. After spending a decade at the NSA hunting vulnerabilities, Tony recognized that the same community-driven approach could revolutionize how we handle reliability issues.<...
Published: Jul 2, 2025Duration: 47m 36s
Episode 30
From DevOps to 'Vibe Coding': Gene Kim on AI-Assisted Development and Platform Engineering
What if you could turn a five-year software project into a one-month endeavor? Gene Kim, co-founder of IT Revolution and author of The Phoenix Project, reveals how AI-powered Vibe Coding is transforming the way developers work.Kim shares insights from his upcoming book about how developers are achieving unprecedented productivity, including how his co-author produces 12,000 lines of production-ready code daily using AI assistance. But it's not just about speed - learn how this approach enables developers to tackle previously impossible projects and explore larger design spaces.From DevOps evolution to practical AI implementation, Kim discusses:<...
Published: May 28, 2025Duration: 56m 32s
Episode 29
Snyk’s Danny Allan on Making Security Developer-Friendly
Security often feels like a roadblock to developers, but what if it could be seamlessly integrated into the development process? As software delivery becomes increasingly automated and self-service, the traditional approach to security needs a major overhaul.Danny Allan, CTO at Snyk, shares practical insights on transforming security from a bottleneck into an enabler of developer productivity. Drawing from his extensive experience at IBM, VMware, and Veeam, Allan discusses how security teams can shift left effectively without creating friction.Key topics covered:Building successful security champions programs that cultivate curiosity rather than relying solely...
Published: Apr 30, 2025Duration: 45m 26s