소개
The eval-design skill provides a systematic framework for moving beyond 'vibes-based' testing to empirical, production-grade AI quality measurement. It guides users through a structured process of identifying real-world failure modes, selecting the appropriate evaluation type—ranging from cost-effective code-based checks to sophisticated LLM-as-judge patterns—and generating precise technical specifications. By focusing on binary pass/fail criteria and calibration, it helps teams build robust golden datasets and automated metrics that are ready for implementation in observability platforms like Langfuse.