Orchestrates multi-agent incident response workflows using modern SRE practices to accelerate service restoration and root cause analysis.
This skill provides a comprehensive Incident Command System (ICS) for Claude Code, guiding users through a structured five-phase response: Detection & Triage, Investigation & RCA, Resolution & Recovery, Communication & Coordination, and Postmortem & Prevention. By leveraging specialized sub-agents for observability, debugging, and security, it automates the heavy lifting of production incident management. It ensures that critical tasks—such as stakeholder updates, observability sweeps, and blameless postmortems—are performed consistently, helping teams maintain SLAs while turning every outage into a learning opportunity.
Key Features
01Automated generation of blameless postmortems and system hardening roadmaps
02Automated incident classification and severity-based communication protocols
0331,721 GitHub stars
04Multi-agent coordination for specialized roles like Security Auditor and Observability Engineer
05Structured five-phase workflow based on SRE best practices
06Deep-dive debugging and root cause analysis using Five Whys methodology
Use Cases
01Coordinating cross-functional communication between engineering and stakeholders during downtime
02Managing high-severity (P0/P1) production outages with a clear command structure
03Performing rapid root cause analysis and implementing emergency hotfixes in production