01Statistical A/B testing framework with p-value and effect size calculations
02Automated metrics including BLEU, ROUGE, and BERTScore for text similarity
03LLM-as-Judge patterns for automated pointwise and pairwise evaluations
04Comprehensive RAG metrics for measuring retrieval (MRR, NDCG) and groundedness
05Regression detection to identify performance drops during the development cycle
060 GitHub stars