Orchestrates end-to-end incident response workflows using modern SRE practices and multi-agent coordination for rapid service restoration.
This skill implements a comprehensive Incident Command System (ICS) to manage production outages and system degradations. It orchestrates specialized AI agents through five distinct phases: detection/triage, deep investigation/RCA, resolution, stakeholder communication, and postmortem analysis. By applying blameless culture and structured SRE principles, it helps engineering teams minimize Mean Time to Resolution (MTTR) while ensuring every incident results in actionable system hardening and improved observability.
主要功能
0139 GitHub stars
02Multi-phase incident management from detection to postmortem
03Orchestrated agent roles including Observability Engineer and Security Auditor
04Automated severity classification (P0-P3) and SLO impact assessment
05Structured communication templates for stakeholders and status pages
06Automated blameless postmortem generation with actionable remediation items
使用场景
01Coordinating multi-agent workflows for complex system recovery
02Conducting deep-dive root cause analysis (RCA) using observability data
03Rapidly triaging and mitigating high-severity (P0/P1) production outages