01Statistical A/B testing and performance regression detection
020 GitHub stars
03Automated NLP metrics including BLEU, ROUGE, and BERTScore
04Human annotation frameworks with inter-rater agreement tracking
05LLM-as-judge patterns for qualitative and semantic assessment
06Retrieval-augmented generation (RAG) specific evaluation (MRR, NDCG)