概要
This skill provides a comprehensive framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets within a Site Reliability Engineering (SRE) context. It guides users through establishing reliability targets, creating Prometheus recording and alert rules, and designing Grafana dashboards to visualize service health. By balancing the cost of downtime with development velocity, it enables engineering teams to make data-driven decisions regarding feature releases versus reliability improvements, ensuring high user satisfaction without sacrificing agility.