01Statistical A/B testing framework with Cohen’s d effect size analysis
020 GitHub stars
03Automated regression detection to identify performance drops before deployment
04Implementation of automated metrics including BLEU, ROUGE, and BERTScore
05LLM-as-judge patterns for pointwise and pairwise model comparison
06Retrieval-specific metrics for RAG systems like MRR, NDCG, and groundedness