01Automated metrics implementation including BLEU, ROUGE, and BERTScore
02RAG-specific retrieval metrics like MRR, NDCG, and Precision@K
030 GitHub stars
04Human evaluation frameworks with inter-rater agreement (Cohen's Kappa)
05LLM-as-judge patterns for semantic quality and pairwise comparison
06Statistical A/B testing and groundedness verification patterns