Where is the benchmark data stored locally?

The skill automatically clones the LoCoMo repository and caches the data in the /tmp/locomo/ directory for evaluation.

How do I run a full memory benchmark?

You can run the full benchmark using the command 'python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full' which evaluates all 10 standard conversations in the dataset.

Does this skill work with any model?

This skill is designed to evaluate the memory capabilities of the model currently configured in your environment, inheriting the model settings from your Claude Code configuration.

What categories of questions are included in the evaluation?

The evaluation includes five categories: Multi-hop (connecting multiple facts), Single-hop (direct retrieval), Temporal (dates/times), Open-domain, and Adversarial (handling missing information).

What does the LoCoMo benchmark measure?

LoCoMo measures the ability of an AI system to remember and retrieve information across long-term, multi-session conversations, focusing on facts that require multi-hop reasoning or temporal awareness.

LoCoMo Memory Benchmark

Name: LoCoMo Memory Benchmark
Author: genomewalker

bygenomewalker

0•

数据科学与机器学习

Evaluates long-term conversational memory performance using the ACL 2024 LoCoMo benchmark suite.

The LoCoMo Benchmark skill is a specialized evaluation tool designed to measure the effectiveness of long-term conversational memory within the cc-soul ecosystem. It automates the process of ingesting multi-session conversation data—extracting observations and speaker facts—and then subjects the system to rigorous QA testing. By measuring retrieval accuracy across multi-hop, temporal, and adversarial categories, it provides developers with a standardized F1 score to compare their AI's memory performance against human baselines and state-of-the-art models like GPT-4.

主要功能

01Detailed F1 score reporting for retrieval accuracy vs. ground truth

02Multi-category evaluation including multi-hop, temporal, and adversarial reasoning

03Integrated comparison against industry baselines and human performance ceilings

04Flexible execution modes for single conversation testing or full benchmark runs

05Automated ingestion of multi-session LoCoMo datasets into memory observations

060 GitHub stars

使用场景

01Quantifying the accuracy of long-term memory retrieval in AI agents

02Benchmarking custom RAG implementations against established AI research standards

03Debugging temporal reasoning and fact-linkage capabilities in conversational models

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add genomewalker/cc-soul locomo-benchmark

For use in Claude.ai and ChatGPT

Download Skill