01LLM-as-judge evaluation patterns for pointwise and pairwise scoring
02Automated regression detection to prevent performance drops in CI/CD
030 GitHub stars
04Automated metric computation including BLEU, ROUGE, and BERTScore
05RAG-specific metrics for retrieval quality and groundedness
06Statistical A/B testing framework with Cohen's d effect size analysis