01Automated NLP metrics including BLEU, ROUGE, and BERTScore
02RAG-specific metrics for retrieval and groundedness assessment
03LLM-as-Judge patterns for scalable, automated quality grading
04Statistical A/B testing and regression detection frameworks
05Human evaluation workflows with inter-rater agreement tracking
060 GitHub stars