01Detailed F1 score reporting for retrieval accuracy vs. ground truth
02Multi-category evaluation including multi-hop, temporal, and adversarial reasoning
03Integrated comparison against industry baselines and human performance ceilings
04Flexible execution modes for single conversation testing or full benchmark runs
05Automated ingestion of multi-session LoCoMo datasets into memory observations
060 GitHub stars