About
This skill provides a structured framework for managing system outages and performance degradation by guiding users through a four-phase workflow: triage, investigation, resolution, and postmortem analysis. By integrating Google SRE practices and ITIL standards, it helps developers and on-call engineers identify root causes using 5-Whys analysis, implement safe rollbacks, and maintain a blameless culture during post-incident reporting. It serves as an essential companion for maintaining high availability and improving system reliability through actionable insights and standardized documentation.