01Statistical A/B testing framework with Cohen's d effect size analysis
02RAG-specific evaluation metrics like MRR, NDCG, and groundedness checks
03Automated NLP metrics including BLEU, ROUGE, and BERTScore
040 GitHub stars
05LLM-as-Judge patterns for qualitative and pairwise scoring
06Automated regression detection to prevent performance degradation