소개
The Agent Quality & Evaluation skill empowers developers to build robust observability into AI agents by implementing structured evaluation metrics and feedback loops. It provides standardized patterns for LLM-as-judge scoring, human feedback collection, and ground truth comparisons across dimensions like correctness, helpfulness, and safety. By linking evaluations directly to execution traces, it enables granular debugging of failures and ensures that agent performance is quantitatively measured and continuously improved from development through to production.