01Complexity stratification to test simple vs. deep reasoning tasks
02Multi-dimensional rubrics for scoring accuracy, completeness, and efficiency
03Context engineering evaluation to identify performance degradation cliffs
04Continuous evaluation pipeline templates for regression tracking
05LLM-as-judge implementation patterns for automated quality assessment
060 GitHub stars