概要
This skill empowers SREs and DevOps teams to proactively manage system health by automating the definition of complex alerting rules within Claude Code. It guides users through selecting metric categories—such as latency, error rates, and resource utilization—while establishing intelligent thresholds based on historical baselines to minimize alert fatigue. Beyond simple rule creation, it streamlines incident response by configuring routing policies, escalation paths, and generating actionable runbooks to ensure teams can respond rapidly to performance anomalies and SLO violations.