How do error budgets affect development velocity?

Error budgets provide a mathematical framework for decision-making; if a budget is exhausted, the team prioritizes reliability improvements over new feature development until the service returns to its SLO target.

What is the difference between an SLI and an SLO?

An SLI (Service Level Indicator) is a quantitative measure of some aspect of the level of service provided, such as latency. An SLO (Service Level Objective) is a target value or range of values for a service level that is measured by an SLI.

How does this skill handle alert fatigue?

The skill implements multi-window burn rate alerts, which combine short-term and long-term windows to ensure that alerts are only triggered for significant reliability threats, filtering out transient spikes.

Does this skill work with Prometheus and Grafana?

Yes, it includes specific PromQL recording rules for SLI calculations and alerting rules for burn rates, as well as a structured blueprint for Grafana dashboards.

Site Reliability SLO Framework

Name: Site Reliability SLO Framework
Author: HermeticOrmus

byHermeticOrmus

0•

分析与监控

Defines and implements Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to establish data-driven reliability targets.

This skill provides a comprehensive framework for implementing Site Reliability Engineering (SRE) practices through the definition of SLIs, SLOs, and error budgets. It guides users through the technical setup of Prometheus recording rules, multi-window burn rate alerts, and Grafana dashboard structures, allowing teams to balance innovation velocity with service stability. By moving away from arbitrary uptime goals toward user-perceived reliability metrics, this skill helps developers implement proactive monitoring and automated error budget policies.

主要功能

01Automated error budget calculation and management policies

02Standardized SLI/SLO/SLA hierarchy for clear internal and external communication

03Multi-window burn rate alerting logic to reduce false positives and alert fatigue

040 GitHub stars

05Prometheus recording rules for availability, latency, and durability SLIs

06Templated Grafana dashboard structures for real-time reliability visualization

使用场景

01Governing feature release velocity based on remaining error budget availability

02Implementing SRE-based alerting to prioritize critical service degradations

03Establishing measurable reliability targets for production microservices

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add hermeticormus/hermetic-academy slo-implementation

For use in Claude.ai and ChatGPT

Download Skill