01Statistical A/B testing with significance and effect size analysis
02Retrieval-Augmented Generation (RAG) metrics like MRR and NDCG
030 GitHub stars
04LLM-as-Judge patterns for pointwise and pairwise evaluation
05Regression detection to prevent performance drops during deployment
06Automated NLP metrics (BLEU, ROUGE, BERTScore, METEOR)