01Human evaluation frameworks with inter-rater agreement calculations
02Retrieval (RAG) performance tracking using MRR, NDCG, and Precision@K
030 GitHub stars
04Automated text metrics including BLEU, ROUGE, and BERTScore
05LLM-as-Judge patterns for qualitative scoring and pairwise comparison
06Custom metric implementation for groundedness, factuality, and toxicity