Conducts blameless incident reviews and root cause analysis to document system failures and prevent future recurrence.
This skill empowers Claude to guide engineering teams through a structured postmortem process following system failures or production incidents. Grounded in Site Reliability Engineering (SRE) principles, it facilitates the creation of a factual, 5-minute interval timeline, identifies systemic root causes using the 5 Whys method, and helps define actionable, measurable improvements. By focusing on process and architectural gaps rather than individual blame, it ensures that every incident becomes a learning opportunity that strengthens system resilience and prevents the repetition of previous errors.
主要功能
019 GitHub stars
02Identification of specific prevention strategies and testing gaps
03Blameless root cause analysis (RCA) focusing on systems and processes
04Precise event timeline construction with 5-minute interval labeling
05Structured 5 Whys and Fishbone diagram methodologies
06Generation of assignable, measurable, and actionable improvement items
使用场景
01Analyzing a production outage to identify architectural vulnerabilities
02Reviewing high-severity bugs that bypassed existing CI/CD gates
03Documenting 'near-miss' incidents to improve observability and monitoring