01Regression detection to prevent performance drops during model updates
023 GitHub stars
03LLM-as-judge patterns for automated qualitative scoring and comparisons
04Statistical A/B testing framework with significance and effect size analysis
05RAG performance tracking for retrieval accuracy and groundedness
06Automated NLP metrics including BLEU, ROUGE, and BERTScore