Conducts systematic incident analysis focusing on systemic causes rather than individual actions to prevent recurrence and build a culture of reliability.
This skill provides a comprehensive framework for performing blameless postmortems following production outages, security breaches, or major bugs. Based on industry-leading SRE practices from Google and Etsy, it guides users through creating structured documentation, including chronological timelines, multi-factor root cause analysis, and prioritized action items. By emphasizing 'how' systems failed over 'who' made a mistake, this skill helps engineering teams foster psychological safety and transform technical failures into long-term organizational learning and improved system resilience.
주요 기능
01Chronological timeline construction with UTC synchronization
020 GitHub stars
03Action item prioritization and tracking frameworks
04Blameless communication and questioning techniques
05Structured templates for comprehensive incident documentation
06Guidance for 'Five Whys' and systemic root cause analysis
사용 사례
01Analyzing production service outages or performance degradation
02Investigating security incidents and data breaches
03Facilitating team retrospectives after major software delivery failures