01RAG-specific metrics like MRR, NDCG, and Groundedness checks
020 GitHub stars
03Regression detection to prevent performance drops during updates
04Automated metrics including BLEU, ROUGE, BERTScore, and Perplexity
05LLM-as-Judge patterns for single output and pairwise comparison
06Statistical A/B testing framework with Cohen's d effect size calculation