Where are the evaluation results stored?

Results are stored locally in your project root under a dedicated .eval-results/ directory, making them easy to track, share, and version-control via Git.

How does it improve AI code reviews?

By enforcing a specific schema and storing results, it allows you to run reviews multiple times and identify which findings are consistent and which might be outliers or hallucinations.

What metrics does the framework provide?

It calculates several key metrics including Jaccard overlap, precision, recall, severity agreement, and category agreement between two or more evaluation sets.

What is the Eval Framework skill?

It is a specialized tool for Claude Code that structures AI evaluation outputs into a comparable format to measure consistency and completeness across different sessions.

Can I compare different models with this framework?

Yes, you can run evaluations using different Claude models (like Sonnet and Opus) and use this skill to generate a comparison report showing the strengths and unique findings of each.

Eval Framework

Name: Eval Framework
Author: maxxentropy

bymaxxentropy

セキュリティとテスト

Standardizes and compares AI-generated evaluations to ensure consistency, accuracy, and reproducibility across multiple runs.

概要

The Eval Framework skill provides a structured meta-framework for managing AI-driven evaluations such as architecture reviews, code audits, and security checks. It addresses the challenge of AI output variance by enforcing a strict YAML schema for findings, storing results in version-controlled files, and providing analytical tools to calculate overlap, precision, and recall between different runs. This allows developers to audit Claude's outputs, validate findings across different models (like Opus vs. Sonnet), and track the evolution of code quality over time with data-backed consistency scores and automated comparison reports.

主な機能

Automated comparison engine to calculate Jaccard overlap, precision, and recall
Cross-model benchmarking for comparing results from different AI versions
0 GitHub stars
Normalization system for categorizing issues across different evaluation types
Version-control friendly storage convention in .eval-results/ directories
Standardized YAML output schema for structured findings and severity ratings

ユースケース

Benchmarking different AI models on the same security audit for higher confidence
Comparing multiple code review runs to ensure no critical bugs were missed
Regression testing to verify if previously identified issues have been successfully resolved

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add maxxentropy/claude-tools eval-framework

For use in Claude.ai and ChatGPT

Download Skill

GitHub

概要