Eval Harness FAQs

Question 1

Where are the evaluation definitions stored?

Accepted Answer

Evaluation definitions, logs, and baselines are stored locally within your project's .claude/evals/ directory, allowing them to be versioned alongside your code.

Question 2

What types of graders does Eval Harness support?

Accepted Answer

It supports Code-based graders (deterministic scripts), Model-based graders (using an LLM as a judge for open-ended output), and Human-review graders for subjective or high-risk outcomes.

Question 3

What is Evaluation-Driven Development (EDD)?

Accepted Answer

EDD is a methodology where you define the expected behavior and evaluation criteria for an AI agent before starting the actual implementation, similar to Test-Driven Development (TDD) for humans.

Question 4

How does pass@k differ from pass^k?

Accepted Answer

pass@k measures if an agent succeeds at least once in 'k' attempts, representing practical reliability, while pass^k measures if the agent succeeds in every single one of 'k' attempts, representing stability.

Question 5

Can I use this for regression testing?

Accepted Answer

Yes, Eval Harness is specifically designed to run regression evaluations to ensure that new code changes or prompt adjustments do not negatively impact previously working functionality.

Eval Harness

Key Features

Use Cases

Eval Harness

Key Features

Use Cases