01RAG-specific evaluation for retrieval (MRR, NDCG) and groundedness
02Statistical A/B testing framework with Cohen's d effect size analysis
03LLM-as-judge patterns for pointwise and pairwise model comparisons
043 GitHub stars
05Automated text metrics including BLEU, ROUGE, and BERTScore
06Automated regression detection to prevent performance drops during updates