01RAG-specific evaluation for retrieval quality and groundedness
021 GitHub stars
03Automated metrics implementation including BLEU, ROUGE, and BERTScore
04Automated regression detection to prevent performance degradation
05LLM-as-judge patterns for pointwise and pairwise model comparisons
06Statistical A/B testing framework with p-value and Cohen's d analysis