010 GitHub stars
02LLM-as-Judge patterns for pointwise and pairwise semantic evaluation
03Statistical A/B testing framework with Cohen’s d effect size analysis
04Retrieval evaluation for RAG systems using MRR, NDCG, and Precision@K
05Automated regression detection to prevent performance degradation
06Automated text generation metrics including BLEU, ROUGE, and BERTScore