Standardizes production incident management with specialized playbooks for on-call engineering and post-mortem documentation.
This skill provides Claude with a comprehensive framework for handling system outages and performance degradations. It empowers developers and SREs to navigate high-pressure on-call scenarios by providing structured templates for incident detection, severity assessment, stakeholder communication, and technical mitigation strategies like rollbacks and service restarts. By embedding best practices for investigation and post-incident analysis, it ensures teams not only resolve issues quickly but also capture the root causes necessary for long-term system reliability.
주요 기능
01Severity-based incident management framework (SEV-1 to SEV-4)
02Standardized communication templates for stakeholder updates
0397 GitHub stars
04Comprehensive post-mortem and root cause analysis templates
05Technical mitigation playbooks for Kubernetes and API services
06On-call handoff procedures and escalation decision trees
사용 사례
01Drafting immediate status updates during an active production outage
02Creating a detailed post-mortem report following a service recovery
03Standardizing on-call handoff documentation for engineering teams