About
This skill provides a comprehensive framework for Site Reliability Engineering (SRE) by helping teams define Service Level Indicators (SLIs), establish internal reliability targets (SLOs), and manage error budgets. It offers practical implementation patterns for Prometheus recording rules, multi-window alerting logic for budget burn rates, and Grafana dashboard structures. By using this skill, developers can move away from reactive firefighting and adopt a data-driven approach to service performance, ensuring that reliability goals are met while maintaining innovation speed.