01Automated metrics integration including BLEU, ROUGE, BERTScore, and Perplexity
02RAG-specific evaluation metrics like MRR, NDCG, and Groundedness
030 GitHub stars
04LLM-as-judge implementation for pointwise and pairwise model comparisons
05Automated regression detection to identify performance degradation between versions
06Statistical A/B testing framework with Cohen’s d effect size calculation