01LLM-as-Judge patterns for qualitative pointwise and pairwise scoring
022 GitHub stars
03Human annotation frameworks with inter-rater agreement calculation
04Automated NLP metrics including BLEU, ROUGE, and BERTScore
05Statistical A/B testing for systematic model comparison
06RAG-specific evaluation for retrieval accuracy and groundedness