01RAG-specific evaluation metrics for retrieval and groundedness
02Automated regression detection to track performance across versions
03LLM-as-Judge patterns for automated pointwise and pairwise grading
04Statistical A/B testing and Cohen's d effect size analysis
05Automated NLP metrics including BLEU, ROUGE, and BERTScore
062 GitHub stars