Running Post-Mortems on Disaster Scenarios with AI
In this post
- Familiarity with incident response or post-mortem processes
- Basic understanding of Claude Code and MCP concepts
Zalando's engineering team had years of postmortem reports sitting in a wiki. According to their 2025 engineering blog post, the traditional approach to incident analysis worked for immediate reactive learning but "does not work well for retrospective analysis of years of past incident reports at company scale." When they finally pointed an LLM at the archive, patterns emerged: recurring failure modes across their datastores, and what the team described as investment opportunities that had been invisible to individual reviewers.
Many organizations treat post-mortems the same way. Write the report. File it. Move on to the next fire. The investigation itself takes hours of manual reconstruction from scattered logs, Slack threads, and ticket histories. By the time the document lands, the team has already context-switched to the next sprint.
AI doesn't just speed this up. It changes what's possible.
Why Post-Mortems Fail Without AI
The pattern is consistent across industries. Post-mortem documentation happens after containment, when the team is exhausted and already thinking about the next priority. PagerDuty's 2025 analysis of security post-mortems found that "documenting what happened takes a back seat to containment and recovery, leaving analysis reliant on memory, scattered notes, and competing narratives."
Three problems compound:
Scattered evidence. Incident data lives across monitoring dashboards, Slack channels, deployment logs, ticket systems, and email threads. Reconstructing a timeline means manually cross-referencing timestamps across half a dozen tools. The person writing the report may not have been in every channel where decisions were made.
No cross-incident memory. Each post-mortem exists in isolation. When a team writes their third database failover report in 18 months, nobody connects it to the first two. The systemic pattern stays invisible because no human has the bandwidth to read and correlate hundreds of past reports.
Blame gravity. Without complete data, post-mortems default to narrative. Narratives have protagonists. Even in organizations that espouse blameless culture, incomplete evidence leaves room for subjective interpretation. The person who pushed the deploy gets more scrutiny than the missing guardrail that allowed the bad deploy to proceed.
Five Ways AI Transforms Disaster Investigations
1. Real-time capture
The biggest gap in traditional post-mortems is documentation. PagerDuty built their Scribe Agent specifically to address this: it's designed to capture conversations, system alerts, and meeting notes in real time during incidents. No one stops to take notes during a production outage. AI does it for them, tagging timestamps, identifying decision points, and capturing the raw material that would otherwise evaporate within hours.
2. Automated timeline reconstruction
Instead of one engineer spending half a day cross-referencing Slack timestamps with deployment logs and monitoring alerts, AI can ingest data from every connected system and assemble the timeline automatically. SolarWinds' 2025 data puts average incident resolution overhead at nearly 5 hours. AI-assisted reconstruction cuts that dramatically. And the timeline tends to be more complete, because AI doesn't forget to check the email thread where the on-call engineer escalated at 2 AM.
3. Cross-incident pattern detection
This is where Zalando found gold. Individual post-mortems describe individual incidents. AI reads hundreds of them and finds the recurring themes: the same database hitting connection limits every quarter, the same microservice failing under identical load patterns, the same deployment step causing rollbacks. These systemic signals are invisible when each report is read in isolation.
4. Causal analysis and blame reduction
AI traces contributing factors as a chain of system interactions, not a sequence of human decisions. A bad deploy didn't cause the outage. A missing canary check allowed the bad deploy. A capacity gap meant the canary wouldn't have caught it anyway. A monitoring blind spot meant the capacity gap was unknown. When the causal chain is data-driven, the conversation shifts from "who did it" to "what allowed it."
5. Predictive insights
Historical post-mortem data becomes a dataset for forecasting. If three of your last five incidents involved a specific service during peak traffic windows, that's not a coincidence. It's a signal. AI surfaces these patterns as risk indicators, turning reactive post-mortems into proactive investment decisions.
From IT Outages to Supply Chain Failures
The same investigation framework applies beyond infrastructure. Any scenario where you need to reconstruct what happened, identify root causes, and prevent recurrence is a post-mortem scenario.
The common thread across all four: multi-source data synthesis and timeline reconstruction. Whether you're analyzing a Kubernetes pod crash or a warehouse flooding event, the investigative process follows the same shape. You gather evidence from disparate sources, stitch together a chronological narrative, then figure out what caused it and what to change so it doesn't happen again.
AI handles the first two steps faster and more completely than any human team. What changes across domains is the data sources, not the analysis pattern.
Building an AI Post-Mortem Workflow with Claude
Here's where theory meets implementation. A Claude-based post-mortem workflow connects to your existing tools through Model Context Protocol (MCP) servers, giving the AI direct access to the raw data it needs to investigate.
The setup is practical. If your team already uses MCP servers to connect Claude to Jira, GitHub, and Confluence, you're halfway there. Add connections to your monitoring and logging tools, and Claude can pull incident data from every source in a single investigation session.
Custom Claude Code skills turn your post-mortem checklist into a repeatable investigation protocol. A post-mortem-investigation skill can encode your team's specific requirements: which systems to check, what questions to answer, how to structure the output, and which stakeholders need what level of detail.
Start with a single incident type. Pick your most frequent failure mode (deployment rollback, database connection exhaustion, API timeout cascade) and build the investigation workflow for that one scenario. Expand to other incident types once the pattern works.
With the right MCP connections in place, Claude doesn't just summarize. It can cross-reference the deploy timestamp against the monitoring alert, pull the Slack thread where someone noticed latency, then check Git for what changed in that window. The investigation follows the evidence, not someone's recollection of the evidence.
What Stays Human
AI handles data synthesis. Humans own judgment.
No tool creates blameless culture. That takes leadership and trust built over dozens of retros where nobody got punished for honesty. But AI makes blameless post-mortems easier to practice by removing the conditions that breed blame in the first place.
When the timeline is reconstructed from system data rather than personal accounts, there's less room for "who did what" narratives. A causal chain showing five contributing factors instead of one bad deploy broadens the conversation on its own. And if the analysis links the current incident to three prior ones? Nobody's talking about individual error anymore. They're talking about where to invest.
Zalando's team reached the same conclusion: "Human curation remains crucial for accuracy, fostering trust, and addressing limitations like hallucinations and surface attribution errors." AI is the investigator. Humans are the judges.
AI-generated post-mortem analyses should always be reviewed by the incident team before distribution. Treat them as a first draft assembled from evidence, not as a final verdict.
Turning Incident Data Into Strategic Advantage
The shift from manual post-mortems to AI-assisted investigations isn't about saving time on a single report. It's about building organizational memory that compounds.
Your incident reports start compounding. Each one feeds the pattern detection that flags the next risk before it pages anyone. They stop collecting dust in Confluence and start working as infrastructure.
Zalando's team called their old reports "dead ends." After pointing AI at the archive, they found investment opportunities and systemic patterns that justified infrastructure changes. That transformation is available to any team willing to connect their incident data to an AI investigation workflow.
If your team runs post-mortems today but struggles to act on them, I can help you build the workflow. From MCP server configuration to custom investigation skills, the pieces exist. The question is simple: are your post-mortem reports doing anything, or just taking up space in a wiki nobody reads?
Want to talk about how this applies to your team?
Book a Discovery CallNot ready for a call? Grab the Claude Adoption Checklist instead.