01LLM-as-judge scoring for generation faithfulness, relevance, and coherence
02Detailed performance reports with latency analysis and failure diagnostics
0317 GitHub stars
04Production-grade benchmarking against the Ailog RAG API
05Automated retrieval metrics calculation including Recall, Precision, MRR, and NDCG
06Synthetic test dataset generation from existing indexed documents