012 GitHub stars
02LLM-as-Judge patterns for pointwise scoring and pairwise model comparisons
03Statistical A/B testing framework with Cohen’s d effect size calculations
04Automated NLP metrics including BLEU, ROUGE, METEOR, and BERTScore
05RAG-specific evaluation for retrieval relevance, NDCG, and groundedness
06Regression detection systems to monitor performance stability over time