How does this skill help reduce Mean Time to Resolution (MTTR)?

It automates the triage and investigation phases by orchestrating specialized agents to perform rapid observability sweeps, log analysis, and initial mitigation steps simultaneously.

How does the multi-agent coordination work?

The skill assigns specific sub-tasks to specialized agent types—such as 'observability-engineer' for metrics or 'security-auditor' for breach analysis—ensuring domain-specific expertise is applied to each part of the incident.

Does this skill support standard SRE severity levels?

Yes, it includes a built-in configuration for P0/SEV-1 (outages) through P3 (cosmetic issues), adjusting the response intensity and communication frequency accordingly.

Can it help with post-incident documentation?

Absolutely. It includes a dedicated phase for generating blameless postmortems, documenting timelines, decision rationale, and creating systematic improvement roadmaps.

SRE Incident Response Framework

Name: SRE Incident Response Framework
Author: lingxling

bylingxling

•

分析与监控

Orchestrates end-to-end incident response workflows using modern SRE practices and multi-agent coordination for rapid service restoration.

This skill implements a comprehensive Incident Command System (ICS) to manage production outages and system degradations. It orchestrates specialized AI agents through five distinct phases: detection/triage, deep investigation/RCA, resolution, stakeholder communication, and postmortem analysis. By applying blameless culture and structured SRE principles, it helps engineering teams minimize Mean Time to Resolution (MTTR) while ensuring every incident results in actionable system hardening and improved observability.

主要功能

0139 GitHub stars

02Multi-phase incident management from detection to postmortem

03Orchestrated agent roles including Observability Engineer and Security Auditor

04Automated severity classification (P0-P3) and SLO impact assessment

05Structured communication templates for stakeholders and status pages

06Automated blameless postmortem generation with actionable remediation items

使用场景

01Coordinating multi-agent workflows for complex system recovery

02Conducting deep-dive root cause analysis (RCA) using observability data

03Rapidly triaging and mitigating high-severity (P0/P1) production outages

主要功能

0139 GitHub stars

02Multi-phase incident management from detection to postmortem

03Orchestrated agent roles including Observability Engineer and Security Auditor

04Automated severity classification (P0-P3) and SLO impact assessment

05Structured communication templates for stakeholders and status pages

06Automated blameless postmortem generation with actionable remediation items

使用场景

01Coordinating multi-agent workflows for complex system recovery

02Conducting deep-dive root cause analysis (RCA) using observability data

03Rapidly triaging and mitigating high-severity (P0/P1) production outages