01Composite evaluator patterns for combining hard constraints with LLM scoring
02Support for deterministic bash-based scoring scaffolds
03Automated generation of Python-based LLM judge evaluators
040 GitHub stars
05Dataset-aware template generation for ground-truth comparisons
06Customizable diagnostic feedback for iterative artifact reflection