The Psychological Safety Imperative
When things break - and they will break - the natural human instinct is to find "The One Who Did It." We want a name. We want a face. We want to fire "John" because John deleted the production database.
This instinct is toxic. It is anti-scientific. It is the enemy of reliability.
If you fire John - you have solved nothing. You have removed one agent from the system - but you have left the Systemic Flaw intact. Why was it possible for John to delete the database? Why did he have root access? Why was there no "Soft Delete" protocol? Why was the restore process not tested?
By punishing John - you send a signal to the rest of the team: "Hide your mistakes." "Do not take risks." "Do not touch the database." You destroy Psychological Safety. Without safety - information flow stops. Engineers stop reporting "Near Misses." They stop asking for help. The system becomes opaque.
How fast can they find the root cause? implies that they are willing to look for it. If they are scared - they will look for an alibi instead. Sidney Dekker, in The Field Guide to Understanding 'Human Error', puts it succinctly:
"You can't punish people and learn at the same time. The two are mutually exclusive. If you punish, you shut down the flow of information that you need to learn." — Sidney Dekker
The Swiss Cheese Model
We adhere to James Reason's Swiss Cheese Model of accident causation. In complex systems - catastrophic failure is rarely caused by a single error. It is caused by the alignment of multiple, smaller failures across different layers of defense.
Imagine slices of Swiss cheese lined up. Each slice is a defense layer.
-
Layer 1: Code Review. (Hole: The reviewer was tired and missed the bug).
-
Layer 2: CI Pipeline. (Hole: The unit tests didn't cover the edge case).
-
Layer 3: Staging Environment. (Hole: Staging data didn't match Production data volume).
-
Layer 4: Permissions Architecture. (Hole: The deployment script ran as root).
The accident happens only when the holes align perfectly - allowing the hazard to pass through all layers. Blaming the engineer (the final layer) ignores the failure of the previous three layers.
Our Blameless Retrospectives focus on identifying these holes. We ask "How" and "Why" - never "Who." We treat the error as a symptom of a fragile system. We patch the holes. We add new slices of cheese.
The Counterfactual Check
To enforce rigor - we use the Counterfactual Check. We ask: "If we replaced John with the best engineer in the world - would this accident still have happened?"
If the answer is "Yes" (because the UI was confusing - or the API was undocumented) - then the engineer is innocent. The system is guilty.
This approach is critical for QA & Security teams. Security is not about "Good People" vs "Bad People." It is about "Robust Systems" vs "Vulnerable Systems." A phishing attack works not because the employee is stupid - but because the email filter failed and the auth system lacked 2FA.
John Allspaw, writing about Etsy's engineering culture, reinforces this view of error as a signal:
"An incident is an unplanned investment. If you don't learn from it, you've wasted the investment." — John Allspaw
Retrospective as Product Feature
We view the Post-Incident Review (PIR) document as a product feature. It is a deliverable. It must be written. It must be shared. It must contain:
- Timeline: A second-by-second account of the failure.
- Root Cause Analysis: The technical physics of the break.
- Corrective Actions: Specific JIRA tickets to fix the holes.
- Learnings: What did we learn about our system that we didn't know before?
This turns failure into an asset. The organization gets smarter with every crash. The "Knowledge Base" grows. The "Mental Model" of the team aligns with reality.
Nancy Leveson, in Engineering a Safer World, argues against the simplicity of linear causality:
"Accidents are not the result of individual component failures, but the result of the interactions between components... We must treat safety as a control problem, not a reliability problem." — Nancy Leveson
This is how you build high-fidelity teams. You don't fire them for making mistakes. You teach them to study mistakes. You convert "Chaos" into "Curriculum."