01LLM-as-Judge patterns for qualitative and pairwise assessments
02RAG-specific evaluation metrics like MRR, NDCG, and groundedness
03Statistical A/B testing framework with significance and effect size
042 GitHub stars
05Automated NLP metrics including BLEU, ROUGE, and BERTScore
06Regression detection to prevent performance degradation