01RAG-specific retrieval metrics like NDCG and Mean Reciprocal Rank (MRR)
02LLM-as-Judge patterns for semantic and qualitative scoring
03Automated regression detection to prevent performance drops during updates
04Automated NLP metrics including BLEU, ROUGE, and BERTScore
05Statistical A/B testing framework with Cohen's d effect size analysis
066 GitHub stars