About
This skill provides a comprehensive framework for Site Reliability Engineering (SRE) practices, allowing teams to measure service performance through specific SLIs like availability, latency, and durability. It guides the setup of internal reliability targets (SLOs), calculates error budgets to balance innovation with stability, and provides ready-to-use Prometheus recording rules and alerting configurations. By implementing multi-window burn rate alerts and standardized Grafana dashboard structures, it helps developers maintain high service health while providing clear data for operational decision-making.