Eval Harness FAQs

Question 1

Does this skill support automated testing?

Accepted Answer

Yes, it supports Code-Based Graders that use deterministic shell commands (like grep or npm test) to automatically verify if the AI's output is correct.

Question 2

What are pass@k metrics in Eval Harness?

Accepted Answer

Pass@k metrics measure the probability that at least one success occurs within 'k' attempts by the AI. It is a standard way to measure the reliability of stochastic models like Claude.

Question 3

What is Eval-Driven Development (EDD)?

Accepted Answer

EDD is a development methodology where evaluations are defined as 'unit tests' for AI behavior before code is written, ensuring the AI-generated solution meets specific success criteria.

Question 4

When should I use a Model-Based Grader?

Accepted Answer

Model-Based Graders are best for evaluating open-ended outputs, such as checking if code is well-structured or if edge cases are handled, where simple string matching or exit codes are insufficient.

Question 5

Where are my evaluation definitions stored?

Accepted Answer

Evaluations are stored directly in your project directory under the .claude/evals/ path, allowing them to be version-controlled alongside your source code.

Eval Harness

Key Features

Use Cases

Eval Harness

Key Features

Use Cases