01Automated regression detection against established performance baselines
024 GitHub stars
03Human evaluation frameworks with inter-rater agreement calculation
04Statistical A/B testing suite with T-tests and effect size analysis
05Scalable LLM-as-judge patterns for pointwise and pairwise evaluation
06Automated text metrics including BLEU, ROUGE, and BERTScore