01Regression detection tools to identify performance drops across different model versions.
02Specialized RAG metrics such as MRR, NDCG, and groundedness/faithfulness checks.
0316 GitHub stars
04LLM-as-Judge patterns for automated semantic assessment and pairwise comparisons.
05Statistical A/B testing framework to measure improvement significance and effect size.
06Automated NLP metrics including BLEU, ROUGE, and BERTScore for text generation quality.