Implements service reliability targets using SLIs, SLOs, and error budgets to balance innovation velocity with system stability.
This skill provides a comprehensive framework for Site Reliability Engineering (SRE) practices, enabling developers to define, measure, and manage service reliability through Claude. It offers standardized patterns for Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets, complete with Prometheus recording rules, sophisticated alerting logic, and Grafana visualization templates. By codifying the relationship between performance metrics and business requirements, it helps engineering teams make data-driven decisions about when to prioritize feature development versus reliability investments.
Características Principales
01Pre-configured Prometheus recording and alerting rules
0213 GitHub stars
03Standardized SLI/SLO/SLA hierarchy definitions
04Multi-window burn rate alert configurations to reduce noise
05Grafana dashboard structures for reliability visualization
06Automated error budget calculation and policy templates
Casos de Uso
01Creating high-fidelity monitoring and alerting for service availability and latency
02Implementing SRE error budget policies to manage development velocity
03Establishing reliability targets for production microservices