Provides a disciplined, evidence-based framework for investigating and resolving production incidents and service outages.
The Runbook skill transforms AI agents into disciplined engineers by enforcing a structured five-phase incident response process: intake, triage, investigation, resolution, and documentation. It prevents the common pitfall of 'guessing' fixes by mandating evidence collection, timeline mapping, and systematic hypothesis testing. Whether dealing with service outages, error spikes, or performance degradation, this skill ensures that root causes are identified and verified before any production changes are applied, leading to more stable systems and better documentation.
Características Principales
01Structured 5-phase incident response workflow (Intake to Document)
02Evidence-based hypothesis testing to prevent premature fixes
03Automated triage logic for severity and blast radius assessment
042 GitHub stars
05Post-incident report generation for future knowledge sharing
06Strategic mitigation options including rollbacks and scaling
Casos de Uso
01Generating comprehensive post-mortem documentation after critical incidents
02Investigating mysterious error rate spikes or regression patterns
03Resolving production service outages and performance degradation