0181 GitHub stars
02Regression detection system to identify and prevent performance drops before deployment.
03Comprehensive automated metrics for text generation, classification, and RAG retrieval.
04Advanced LLM-as-Judge patterns for both pointwise scoring and pairwise comparisons.
05Statistical A/B testing framework using Cohen's d and t-tests to validate model improvements.
06Structured human evaluation frameworks with inter-rater agreement (Cohen's Kappa) calculations.