01LLM-as-Judge patterns for automated pointwise and pairwise output comparisons
02Specialized RAG evaluation metrics for retrieval performance and groundedness
03Automated text generation and classification metrics including BLEU, ROUGE, and BERTScore
04Statistical A/B testing framework with Cohen’s d effect size and p-value analysis
05424 GitHub stars
06Automated regression detection to track performance across model and prompt versions