01LLM-as-judge implementation for scalable, automated performance grading
02Complexity-stratified test set generation for simple to research-level tasks
03Token budget and tool-call optimization analysis based on performance research
04Multi-dimensional rubric design for accuracy, completeness, and tool efficiency
05Context engineering validation to measure the impact of prompts and history
0610 GitHub stars