01Human evaluation frameworks with inter-rater agreement (Cohen's Kappa) calculations
02LLM-as-judge patterns for pointwise, pairwise, and reference-based comparisons
03Custom metric development for groundedness, toxicity, and factuality detection
04Automated NLP metrics implementation including BLEU, ROUGE, and BERTScore
05RAG performance measurement with MRR, NDCG, and Precision@K
060 GitHub stars