Can it help during an active system incident?

Yes, it provides structured incident management workflows, helps identify recovery steps via runbooks, and facilitates the creation of blameless postmortems.

What is chaos engineering in this context?

The skill guides you in designing controlled experiments that inject failures into your system to uncover weaknesses before they cause real-world outages.

What is the SRE Engineer skill for Claude Code?

It is a specialized capability that enables Claude to provide expert-level guidance on maintaining system reliability, defining performance metrics, and automating operations.

Does this skill help with automation?

Absolutely. It focuses on 'toil reduction' by identifying repetitive manual tasks and generating scripts or Terraform code to automate them.

How does this skill handle SLOs and SLIs?

The skill helps you identify meaningful Service Level Indicators (SLIs) and set realistic Service Level Objectives (SLOs) that balance user experience with development speed.

SRE & Reliability Engineer

Name: SRE & Reliability Engineer
Author: Jeffallan

byJeffallan

•

Analíticas y Monitorización

Implements high-availability system practices through SLO management, toil reduction, and automated monitoring strategies.

The SRE Engineer skill empowers Claude to act as a senior Site Reliability Engineer, focusing on the critical balance between feature velocity and system stability. It provides specialized logic for defining quantitative SLIs and SLOs, managing error budgets, and implementing 'golden signal' monitoring (latency, traffic, errors, and saturation). By leveraging this skill, developers can automate repetitive operational toil, design chaos engineering experiments to test system resilience, and establish professional incident management workflows including blameless postmortems and actionable runbooks.

Características Principales

01Chaos engineering experiment design and resilience testing

02Quantitative SLI/SLO definition and error budget calculation

037 GitHub stars

04Toil reduction through targeted automation and scripting

05Golden signal monitoring and alerting configuration for observability

06Blameless postmortem generation and incident response planning

Casos de Uso

01Conducting incident root-cause analysis and creating remediation runbooks

02Establishing a reliability framework and monitoring for new microservices

03Automating manual infrastructure tasks to reduce operational overhead

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add jeffallan/claude-skills sre-engineer

For use in Claude.ai and ChatGPT

Download Skill