소개
This skill equips developers with standardized methodologies for assessing autonomous agent systems, solving for challenges like non-determinism and variable execution paths. It facilitates the creation of multi-dimensional rubrics—evaluating factors like factual accuracy, tool efficiency, and citation quality—while supporting LLM-as-judge and human-in-the-loop workflows. By applying complexity stratification and token budget analysis, it ensures that context engineering decisions are data-driven and that agent behaviors remain reliable across model upgrades and system changes.