Why Azure Says Stop Blaming People for Outages

TL;DR: A post-mortem of Azure's 2023 global outage reveals a crucial lesson: "human error" is a myth. Engineering leaders should instead focus on fixing systemic flaws to build truly resilient systems and protect their teams from blame.
Key facts
- Category
- Infrastructure
- Impact
- High
- Published
- Source
- InfoQ
Full summary
A deep dive into Azure's 2023 global outage reveals why blaming "human error" is a dangerous myth for modern engineering teams.
In an analysis of Azure's 2023 global WAN outage, speaker Sean Klein argues that "human error" is a dangerous myth in complex systems. While it’s easy to blame an individual for a mistake, this approach often hides deeper, systemic problems. Traditional incident reviews, like the "Five Whys" method, can prematurely stop at a person's action instead of asking why the system allowed that action to have catastrophic consequences. The presentation highlights that in modern, large-scale infrastructure, a single action should rarely be capable of causing a global failure. The real cause is often found in flawed processes, inadequate safeguards, or gaps in tooling that created the conditions for the error to occur and escalate.
This shift in perspective is critical for engineering leaders, CTOs, and IT teams. Instead of focusing on blame, the goal should be to build more resilient systems that protect engineers from making critical mistakes. This means designing infrastructure that anticipates errors and has automated checks and balances to prevent them from cascading. It involves improving Standard Operating Procedures (SOPs) to be clearer and more effective, and providing engineers with better tools and training. By treating every incident as a system failure, not a personal one, organizations can foster a culture of psychological safety. This encourages open communication and allows teams to learn from mistakes without fear, leading to more robust and reliable services.
The lessons from Azure’s outage provide a valuable framework for any company running critical infrastructure. Moving beyond blame allows for a more honest and effective post-mortem process, uncovering vulnerabilities that would otherwise remain hidden. This modern approach to incident analysis helps organizations understand the complex interactions between people, processes, and technology. Ultimately, by focusing on systemic resilience, businesses can reduce the frequency and impact of future outages, improve operational stability, and create a more sustainable and supportive engineering environment.
Why it matters
Blaming 'human error' for outages masks deeper systemic flaws and creates a culture of fear. This analysis shows leaders how to shift focus to building resilient systems, which improves reliability and supports engineering teams.
Business impact
Adopting a 'systems-first' approach to incident analysis reduces the frequency and severity of costly outages. It also improves engineering culture, which can help with talent retention and innovation by creating a psychologically safe environment for teams to operate and learn.
Tags
Related on Notifire
Related stories
Primary source: InfoQ