01RAG evaluation metrics for retrieval systems such as MRR and NDCG
02LLM-as-Judge patterns for automated pointwise and pairwise comparisons
0315,684 GitHub stars
04Automated text generation metrics including BLEU, ROUGE, and BERTScore
05A/B testing framework with statistical significance and effect size analysis
06Regression detection to identify performance drops before deployment