About
This skill provides Claude with advanced strategies for monitoring and evaluating AI agents in live environments without impacting latency or operational costs. It enables sophisticated sampling techniques—ranging from random and stratified to error-biased sampling—while implementing non-blocking async evaluation queues. By leveraging LLM-as-judge heuristics and baseline comparisons, it helps developers detect quality regressions, manage evaluation budgets, and ensure consistent agent behavior across diverse production traffic.