01LLM-as-Judge patterns for automated qualitative assessment
02RAG-specific evaluation for retrieval (MRR, NDCG) and groundedness
030 GitHub stars
04Statistical A/B testing framework with Cohen's d effect size
05Automated regression detection to prevent quality drops during deployment
06Automated NLP metrics including BLEU, ROUGE, and BERTScore