01Statistical A/B testing framework with p-value and effect size analysis
02Automated NLP metrics including BLEU, ROUGE, and BERTScore
03RAG-specific evaluation for retrieval precision and groundedness
04Automated regression detection against established performance baselines
050 GitHub stars
06LLM-as-Judge patterns for pointwise and pairwise quality assessment