Architects and implements production-grade observability systems including monitoring, logging, tracing, and reliability workflows.
This skill transforms Claude into a specialized observability engineer capable of designing and maintaining enterprise-scale reliability infrastructure. It provides deep expertise in the three pillars of observability—metrics, logs, and traces—helping teams define meaningful SLIs/SLOs, reduce alert noise, and optimize monitoring costs. Whether you are migrating to OpenTelemetry, configuring complex ELK stacks, or establishing incident response playbooks, this skill provides the domain-specific guidance needed to ensure system stability and performance in distributed environments.
Características Principales
0139 GitHub stars
02End-to-end monitoring infrastructure design using Prometheus, Grafana, and Datadog
03Centralized log management strategy using ELK Stack, Loki, and Splunk
04Automated incident response workflows and blameless postmortem templates
05Distributed tracing implementation with OpenTelemetry and Jaeger for microservices
06SLI/SLO framework development with error budget and burn rate analysis
Casos de Uso
01Troubleshooting production performance regressions using distributed tracing and APM
02Optimizing observability costs through intelligent sampling and retention policies
03Designing a scalable monitoring strategy for high-traffic microservices architectures