Builds production-ready monitoring, logging, and tracing systems for enterprise-scale application reliability.
The Observability Engineer skill empowers Claude to act as a senior reliability specialist, focusing on the design and implementation of comprehensive monitoring strategies. It provides deep expertise in distributed tracing, log management, and time-series metrics using industry standards like OpenTelemetry, Prometheus, and the ELK stack. Use this skill to define meaningful SLIs/SLOs, establish actionable alerting thresholds, manage incident response workflows, and implement observability-as-code to ensure high system availability and performance.
Características Principales
01Incident response automation and runbook development
02Distributed tracing and APM implementation using OpenTelemetry standards
03Multi-cloud and Kubernetes infrastructure monitoring and alerting
0431,721 GitHub stars
05Comprehensive SLI/SLO management and error budget tracking
06Advanced log aggregation and analysis with ELK, Loki, and Splunk
Casos de Uso
01Architecting a monitoring strategy for high-traffic microservices
02Designing cost-optimized telemetry pipelines for enterprise logs and metrics
03Debugging complex performance regressions across distributed systems