013 GitHub stars
02Human evaluation frameworks with inter-rater agreement (Cohen's Kappa)
03LLM-as-Judge patterns for pointwise and pairwise assessment
04RAG-specific metrics for groundedness, retrieval, and factuality
05Statistical A/B testing with t-tests and effect size calculations
06Automated text generation metrics (BLEU, ROUGE, BERTScore)