How does this help with alert fatigue?

The skill focuses on defining actionable, symptom-based alerts linked to SLOs rather than noisy cause-based alerts, helping teams reduce false positives.

Can this skill help with OpenTelemetry migrations?

Yes, it provides expert guidance on deploying collectors, implementing auto-instrumentation, and transitioning from proprietary vendors to vendor-agnostic open standards.

Can it assist with observability cost management?

Yes, it offers specific strategies for managing high-cardinality metrics, optimizing log storage, and implementing trace sampling to control infrastructure spend.

Does it support cloud-native environments?

Absolutely. It includes specialized patterns for Kubernetes monitoring, service mesh telemetry, and serverless observability across AWS, GCP, and Azure.

Observability & SRE Engineer

Name: Observability & SRE Engineer
Author: lingxling

bylingxling

•

Analíticas y Monitorización

Architects and implements production-grade observability systems including monitoring, logging, tracing, and reliability workflows.

This skill transforms Claude into a specialized observability engineer capable of designing and maintaining enterprise-scale reliability infrastructure. It provides deep expertise in the three pillars of observability—metrics, logs, and traces—helping teams define meaningful SLIs/SLOs, reduce alert noise, and optimize monitoring costs. Whether you are migrating to OpenTelemetry, configuring complex ELK stacks, or establishing incident response playbooks, this skill provides the domain-specific guidance needed to ensure system stability and performance in distributed environments.

Características Principales

0139 GitHub stars

02End-to-end monitoring infrastructure design using Prometheus, Grafana, and Datadog

03Centralized log management strategy using ELK Stack, Loki, and Splunk

04Automated incident response workflows and blameless postmortem templates

05Distributed tracing implementation with OpenTelemetry and Jaeger for microservices

06SLI/SLO framework development with error budget and burn rate analysis

Casos de Uso

01Troubleshooting production performance regressions using distributed tracing and APM

02Optimizing observability costs through intelligent sampling and retention policies

03Designing a scalable monitoring strategy for high-traffic microservices architectures

Características Principales

0139 GitHub stars

02End-to-end monitoring infrastructure design using Prometheus, Grafana, and Datadog

03Centralized log management strategy using ELK Stack, Loki, and Splunk

04Automated incident response workflows and blameless postmortem templates

05Distributed tracing implementation with OpenTelemetry and Jaeger for microservices

06SLI/SLO framework development with error budget and burn rate analysis

Casos de Uso

01Troubleshooting production performance regressions using distributed tracing and APM

02Optimizing observability costs through intelligent sampling and retention policies

03Designing a scalable monitoring strategy for high-traffic microservices architectures