01LLM-as-Judge patterns for pointwise and pairwise qualitative assessment
02Automated regression detection to prevent performance drift in CI/CD
030 GitHub stars
04Automated NLP metrics including BLEU, ROUGE, and BERTScore
05Statistical A/B testing framework with Cohen's d effect size analysis
06RAG-specific evaluation for groundedness, relevance, and retrieval quality