Architects resilient software systems that minimize downtime through fault tolerance, automated recovery patterns, and SRE best practices.
The Reliability Design skill enables Claude to guide developers and architects through the creation of highly available systems using industry-standard principles from Google SRE and Nygard's Release It!. It provides a structured framework for defining measurable service levels (SLAs, SLOs, and SLIs), performing failure mode analysis, and implementing defensive architectural patterns like circuit breakers, bulkheads, and graceful degradation. This skill is essential for teams looking to reduce Mean Time To Recovery (MTTR), increase Mean Time Between Failures (MTBF), and ensure that distributed systems remain stable under stress or component failure.
主要功能
01Failure mode and effects analysis (FMEA) for components
02Automated recovery and failover mechanism planning
03Observability instrumentation for critical path monitoring
04Fault isolation using bulkheads and circuit breakers
05Service Level (SLA/SLO/SLI) definition and measurement
069 GitHub stars
使用场景
01Improving system uptime and reducing Mean Time To Recovery (MTTR)
02Designing disaster recovery strategies for mission-critical cloud services
03Implementing graceful degradation patterns for microservices architectures