01LLM-as-Judge patterns for pointwise and pairwise evaluation
02Regression detection to prevent performance drift in updates
03Automated NLP metrics including BLEU, ROUGE, and BERTScore
04Retrieval-specific metrics for RAG systems like MRR and NDCG
050 GitHub stars
06Statistical A/B testing framework with Cohen's d effect size