Implements comprehensive observability architectures including structured logging, distributed tracing, and symptom-based alerting systems.
This skill enables engineers to architect and deploy production-grade observability stacks using modern standards like OpenTelemetry. It facilitates the implementation of the three pillars of observability—logs, metrics, and traces—with a specific focus on cross-pillar correlation. By moving teams from reactive debugging to proactive SLO-based management, it supports major platforms like ELK, Prometheus, Grafana, and Datadog to ensure high system reliability and efficient error budget management.
Key Features
01OpenTelemetry (OTel) integration and collector configuration
02RED and USE metrics methodology implementation
03Structured logging with mandatory trace and span correlation
04SLI/SLO/SLA definition and error budget tracking
05Dashboard design for Grafana, Datadog, and ELK/OpenSearch
0611 GitHub stars
Use Cases
01Implementing distributed tracing across microservices to identify latency bottlenecks
02Setting up symptom-based alerting rules and SLI dashboards for production reliability
03Designing a vendor-neutral observability backend using OpenTelemetry