01Statistical A/B testing framework with Cohen's d effect size
02Specialized RAG metrics for retrieval quality and groundedness
03LLM-as-judge patterns for pointwise and pairwise semantic evaluation
04Automated regression detection to prevent performance degradation
050 GitHub stars
06Automated text metrics including BLEU, ROUGE, and BERTScore