ai-strategyworkflow

Running Post-Mortems on Disaster Scenarios with AI

March 29, 2026 ·12 min read · Mitchel Lairscey
In this post
Before you start
  • Familiarity with incident response or post-mortem processes
  • Basic understanding of Claude Code and MCP concepts

Zalando's engineering team had years of postmortem reports sitting in a wiki. According to their 2025 engineering blog post, the traditional approach to incident analysis worked for immediate reactive learning but "does not work well for retrospective analysis of years of past incident reports at company scale." When they finally pointed an LLM at the archive, patterns emerged: recurring failure modes across their datastores, and what the team described as investment opportunities that had been invisible to individual reviewers.

Many organizations treat post-mortems the same way. Write the report. File it. Move on to the next fire. The investigation itself takes hours of manual reconstruction from scattered logs, Slack threads, and ticket histories. By the time the document lands, the team has already context-switched to the next sprint.

AI doesn't just speed this up. It changes what's possible.

Why Post-Mortems Fail Without AI

47% of tech leaders never conduct post-incident reviews Infrascale 2025 ~5 hrs avg. per incident in resolution overhead SolarWinds 2025 30% of organizations regularly test their IR plans IBM / Ponemon

The pattern is consistent across industries. Post-mortem documentation happens after containment, when the team is exhausted and already thinking about the next priority. PagerDuty's 2025 analysis of security post-mortems found that "documenting what happened takes a back seat to containment and recovery, leaving analysis reliant on memory, scattered notes, and competing narratives."

Three problems compound:

Scattered evidence. Incident data lives across monitoring dashboards, Slack channels, deployment logs, ticket systems, and email threads. Reconstructing a timeline means manually cross-referencing timestamps across half a dozen tools. The person writing the report may not have been in every channel where decisions were made.

No cross-incident memory. Each post-mortem exists in isolation. When a team writes their third database failover report in 18 months, nobody connects it to the first two. The systemic pattern stays invisible because no human has the bandwidth to read and correlate hundreds of past reports.

Blame gravity. Without complete data, post-mortems default to narrative. Narratives have protagonists. Even in organizations that espouse blameless culture, incomplete evidence leaves room for subjective interpretation. The person who pushed the deploy gets more scrutiny than the missing guardrail that allowed the bad deploy to proceed.

THE TRADITIONAL POST-MORTEM GAP INCIDENT SCATTERED DATA Slack Datadog Jira Git PD Email MANUAL RECONSTRUCTION REPORT SHELF Each step loses fidelity. By the time the report is complete, critical context has evaporated and the document joins the archive. HIGH FIDELITY LOW FIDELITY

Five Ways AI Transforms Disaster Investigations

THE AI POST-MORTEM PIPELINE CAPTURE Real-time data ingestion during the incident 01 RECONSTRUCT Automated timeline from logs, chat, and deployments 02 DETECT Cross-incident pattern matching across history 03 ANALYZE Causal chain mapping with blame reduction 04 PREDICT Systemic risk signals from historical data 05 Phases 1-2: During & After the Incident AI captures data in real time and builds a complete timeline without relying on anyone's memory. Phases 3-5: Retrospective Analysis AI reads across your entire incident history to find systemic patterns invisible to individual reviewers. CONTINUOUS LOOP

1. Real-time capture

The biggest gap in traditional post-mortems is documentation. PagerDuty built their Scribe Agent specifically to address this: it's designed to capture conversations, system alerts, and meeting notes in real time during incidents. No one stops to take notes during a production outage. AI does it for them, tagging timestamps, identifying decision points, and capturing the raw material that would otherwise evaporate within hours.

2. Automated timeline reconstruction

Instead of one engineer spending half a day cross-referencing Slack timestamps with deployment logs and monitoring alerts, AI can ingest data from every connected system and assemble the timeline automatically. SolarWinds' 2025 data puts average incident resolution overhead at nearly 5 hours. AI-assisted reconstruction cuts that dramatically. And the timeline tends to be more complete, because AI doesn't forget to check the email thread where the on-call engineer escalated at 2 AM.

3. Cross-incident pattern detection

This is where Zalando found gold. Individual post-mortems describe individual incidents. AI reads hundreds of them and finds the recurring themes: the same database hitting connection limits every quarter, the same microservice failing under identical load patterns, the same deployment step causing rollbacks. These systemic signals are invisible when each report is read in isolation.

4. Causal analysis and blame reduction

AI traces contributing factors as a chain of system interactions, not a sequence of human decisions. A bad deploy didn't cause the outage. A missing canary check allowed the bad deploy. A capacity gap meant the canary wouldn't have caught it anyway. A monitoring blind spot meant the capacity gap was unknown. When the causal chain is data-driven, the conversation shifts from "who did it" to "what allowed it."

5. Predictive insights

Historical post-mortem data becomes a dataset for forecasting. If three of your last five incidents involved a specific service during peak traffic windows, that's not a coincidence. It's a signal. AI surfaces these patterns as risk indicators, turning reactive post-mortems into proactive investment decisions.

From IT Outages to Supply Chain Failures

The same investigation framework applies beyond infrastructure. Any scenario where you need to reconstruct what happened, identify root causes, and prevent recurrence is a post-mortem scenario.

WITHOUT AI WITH AI Deployment Failure 6 hrs searching logs across 4 tools. Report based on 2 engineers' memory. Deployment Failure AI builds timeline in minutes from CI/CD, logs, and Slack. Links to 3 prior similar incidents. Security Breach Forensics team spends days assembling access logs, network traces, and audit trails. Security Breach AI correlates access patterns, flags anomalous lateral movement, and generates attack timeline. Supply Chain Disruption Procurement reviews vendor emails, shipping records, and ERP data manually over weeks. Supply Chain Disruption AI ingests ERP, logistics, and vendor data to map cascade effects and single-source risks. Natural Disaster / Crisis Event After-action review assembles reports from multiple agencies. Takes months. Natural Disaster / Crisis Event AI synthesizes sensor data, response logs, and communication records into unified timeline.

The common thread across all four: multi-source data synthesis and timeline reconstruction. Whether you're analyzing a Kubernetes pod crash or a warehouse flooding event, the investigative process follows the same shape. You gather evidence from disparate sources, stitch together a chronological narrative, then figure out what caused it and what to change so it doesn't happen again.

AI handles the first two steps faster and more completely than any human team. What changes across domains is the data sources, not the analysis pattern.

Building an AI Post-Mortem Workflow with Claude

Here's where theory meets implementation. A Claude-based post-mortem workflow connects to your existing tools through Model Context Protocol (MCP) servers, giving the AI direct access to the raw data it needs to investigate.

CLAUDE POST-MORTEM ARCHITECTURE DATA SOURCES Logs Alerts Jira # Slack Git Runbooks MCP SERVERS Claude Investigation Agent + post-mortem skill + investigation checklist + blameless template OUTPUTS Timeline Minute-by-minute event reconstruction Causal Map Contributing factor chain analysis Report Structured blameless post-mortem document Pattern Database Cross-incident knowledge

The setup is practical. If your team already uses MCP servers to connect Claude to Jira, GitHub, and Confluence, you're halfway there. Add connections to your monitoring and logging tools, and Claude can pull incident data from every source in a single investigation session.

Custom Claude Code skills turn your post-mortem checklist into a repeatable investigation protocol. A post-mortem-investigation skill can encode your team's specific requirements: which systems to check, what questions to answer, how to structure the output, and which stakeholders need what level of detail.

Tip

Start with a single incident type. Pick your most frequent failure mode (deployment rollback, database connection exhaustion, API timeout cascade) and build the investigation workflow for that one scenario. Expand to other incident types once the pattern works.

With the right MCP connections in place, Claude doesn't just summarize. It can cross-reference the deploy timestamp against the monitoring alert, pull the Slack thread where someone noticed latency, then check Git for what changed in that window. The investigation follows the evidence, not someone's recollection of the evidence.

What Stays Human

AI handles data synthesis. Humans own judgment.

AI Handles Collecting data from every connected source Assembling chronological timelines Matching patterns across incident history Drafting structured investigation reports Flagging risk signals from historical data Humans Own Deciding what to do about the findings Driving cultural and process change Prioritizing investments and tradeoffs Facilitating blameless review conversations Validating AI analysis against context

No tool creates blameless culture. That takes leadership and trust built over dozens of retros where nobody got punished for honesty. But AI makes blameless post-mortems easier to practice by removing the conditions that breed blame in the first place.

When the timeline is reconstructed from system data rather than personal accounts, there's less room for "who did what" narratives. A causal chain showing five contributing factors instead of one bad deploy broadens the conversation on its own. And if the analysis links the current incident to three prior ones? Nobody's talking about individual error anymore. They're talking about where to invest.

Zalando's team reached the same conclusion: "Human curation remains crucial for accuracy, fostering trust, and addressing limitations like hallucinations and surface attribution errors." AI is the investigator. Humans are the judges.

Important

AI-generated post-mortem analyses should always be reviewed by the incident team before distribution. Treat them as a first draft assembled from evidence, not as a final verdict.

Turning Incident Data Into Strategic Advantage

The shift from manual post-mortems to AI-assisted investigations isn't about saving time on a single report. It's about building organizational memory that compounds.

Your incident reports start compounding. Each one feeds the pattern detection that flags the next risk before it pages anyone. They stop collecting dust in Confluence and start working as infrastructure.

Zalando's team called their old reports "dead ends." After pointing AI at the archive, they found investment opportunities and systemic patterns that justified infrastructure changes. That transformation is available to any team willing to connect their incident data to an AI investigation workflow.

If your team runs post-mortems today but struggles to act on them, I can help you build the workflow. From MCP server configuration to custom investigation skills, the pieces exist. The question is simple: are your post-mortem reports doing anything, or just taking up space in a wiki nobody reads?


Want to talk about how this applies to your team?

Book a Discovery Call

Not ready for a call? Grab the Claude Adoption Checklist instead.

Keep reading