01RAG-specific evaluation for retrieval quality and groundedness
02Regression detection to identify performance drops before deployment
03LLM-as-Judge patterns for automated pointwise and pairwise scoring
04Automated NLP metrics including BLEU, ROUGE, METEOR, and BERTScore
05Statistical A/B testing framework with Cohen’s d effect size calculation
060 GitHub stars