Automates incident reporting, root cause analysis, and post-mortem documentation to drive systemic improvements in system reliability.
This skill streamlines the entire incident management lifecycle, providing standardized frameworks for post-mortems, root cause analysis (RCA), and corrective action tracking. It enforces industry best practices such as blameless reviews, the 'Five Whys' methodology, and severity-based escalation paths. By integrating with ITSM logs and team communication tools, it helps engineering and operations teams move beyond proximate causes to identify systemic gaps, ensuring every outage results in verifiable, time-bound improvements that prevent recurrence.
主要功能
01Blame-free incident review templates and lessons learned briefs
02Guided Five Whys root cause analysis drill
0311 GitHub stars
04Standardized Post-Mortem and Incident Log generation
05Severity-based classification and escalation workflows (P1-P3)
06Verifiable corrective action tracking with ownership and deadlines
使用场景
01Conducting a structured post-mortem after a critical P1 service outage
02Maintaining a centralized tracker for long-term corrective actions across multiple incidents
03Drilling down into systemic process failures using the Five Whys methodology