01Statistical drift detection (PSI, KS tests) to identify quality regressions in production models
02Infrastructure monitoring with Prometheus metrics and Grafana dashboard templates
03Standardized alerting rules and severity levels for effective incident management
04116 GitHub stars
05Silent failure detection to alert on tool skipping, token spikes, and quality degradation
06Advanced LLM observability using Langfuse for tracing, cost tracking, and evaluation scoring