01Task-specific metric evaluation (BLEU, ROUGE, F1)
02Automated model evaluation loop and metric computation
03Checkpoint identification by path, step, or version
04Support for validation, test, and custom datasets
05Multi-checkpoint comparison and reporting
060 GitHub stars