Discover Agent Skills for analytics & monitoring. Browse 47 skills for Claude, ChatGPT & Codex.
Automates the detection, classification, and resolution of system errors using a 13-category taxonomy and systematic recovery patterns.
Manages production incidents using SRE methodologies for rapid investigation, mitigation, and postmortem documentation.
Implements a systematic error-handling methodology using a 13-category taxonomy to diagnose, recover from, and prevent session failures.
Implements comprehensive logging, metrics, and distributed tracing to ensure production reliability and performance monitoring.
Implements production-grade observability for Cloudflare Workers using structured logging, real-time log streaming, and custom performance metrics.
Analyzes and optimizes application performance across algorithms, databases, and frontend frameworks to resolve bottlenecks and reduce latency.
Implements production-grade monitoring, logging, and tracing systems to ensure application reliability and performance.
Diagnoses complex software errors using automated stack trace analysis and systematic root cause investigation.
Simplifies GTM implementations by providing expert guidance on tags, triggers, variables, and data layer configurations.
Validates Prometheus metrics implementation in Go applications to ensure optimal observability and performance.
Automates error capture, intelligent batching, and structured logging to streamline AI agent recovery and orchestration.
Diagnoses and resolves failures in dbt Cloud and platform jobs using systematic workflows and root cause analysis.
Profiles and optimizes OCaml memory allocations to eliminate boxing overhead and improve application performance.
Answers complex business questions by intelligently navigating dbt semantic layers, models, and project artifacts.
Implements comprehensive request tracking across microservices using Jaeger and Tempo to identify performance bottlenecks and service dependencies.
Configures Prometheus for comprehensive metric collection, alerting, and observability across infrastructure and applications.
Implements service reliability targets using SLIs, SLOs, and error budgets to balance innovation velocity with system stability.
Implements comprehensive Kafka monitoring using Prometheus and Grafana to track cluster health, consumer lag, and broker performance.
Identifies, quantifies, and prioritizes technical debt within codebases using ROI-based remediation plans.
Guides the configuration and troubleshooting of dbt MCP server connections for AI-powered data engineering and analytics workflows.
Automates the creation and management of production-grade Grafana dashboards for real-time system and application observability.
Visualizes Rust project dependencies and coupling metrics through an interactive web interface to identify architectural hotspots.
Enables comprehensive audit trails and structured logging for Docker-based Claude Flow Novice (CFN) agent execution.
Analyzes and optimizes JavaScript bundle sizes to improve web application performance and load times.
Monitors Hawk job status, retrieves logs, and diagnoses issues for AI evaluation runs within the UK AISI Inspect framework.
Architects robust, hook-based event systems to monitor and broadcast real-time AI agent activities within Claude Code workflows.
Guides the definition of reliability targets, selection of service indicators, and implementation of error budget policies.
Implements and debugs request flows across microservices using OpenTelemetry standards and distributed tracing patterns.
Standardizes incident management processes, from initial detection and triage to postmortem analysis and reliability improvements.
Optimizes end-to-end performance and reduces response times across distributed systems by tuning network protocols, application logic, and database layers.
Scroll for more results...