01Human evaluation workflows with inter-rater agreement (Cohen's Kappa) tracking
020 GitHub stars
03Automated metrics integration including BLEU, ROUGE, and BERTScore
04LLM-as-judge patterns for pointwise and pairwise qualitative assessment
05Regression detection to prevent performance drops during model or prompt updates
06Statistical A/B testing framework with Cohen's d effect size analysis