Eval Harness FAQs

Question 1

How does the Eval Harness handle regression testing?

Accepted Answer

It allows users to define regression evals that compare current session outputs against established baselines or SHAs to ensure new changes haven't broken existing functionality.

Question 2

What are pass@k metrics?

Accepted Answer

Pass@k measures the probability that at least one successful result is achieved within 'k' attempts, helping developers understand the reliability and consistency of the AI's solutions.

Question 3

Where are the evaluation results stored?

Accepted Answer

All evaluation definitions, run logs, and regression baselines are stored locally within the project's .claude/evals/ directory for version control integration.

Question 4

Can I use existing test suites with this harness?

Accepted Answer

Yes, the Code-Based Grader is designed to integrate with any bash-compatible test runner like Vitest, Jest, or Pytest to verify outputs deterministically.

Question 5

What is Eval-Driven Development (EDD)?

Accepted Answer

EDD is a development methodology where success criteria and evaluation benchmarks are defined before the AI begins coding, ensuring performance is measured objectively rather than subjectively.

Eval Harness

Key Features

Use Cases

Eval Harness

Key Features

Use Cases