Evaluation-Driven Development Harness FAQs

Question 1

What is Evaluation-Driven Development (EDD)?

Accepted Answer

EDD is a workflow where you define success criteria and evaluation benchmarks for an AI agent before writing the implementation code, similar to how Test-Driven Development (TDD) works for traditional software.

Question 2

Where are my evaluation results and logs stored?

Accepted Answer

All evaluation definitions, execution logs, and baselines are stored within your project in the .claude/evals/ directory, making it easy to version control your tests with your codebase.

Question 3

Does this support manual verification?

Accepted Answer

Yes, the framework includes human grader flags for high-risk or subjective tasks that require a person to review the output before it is marked as a pass.

Question 4

How does pass@k improve my AI development workflow?

Accepted Answer

It measures reliability by calculating the probability of at least one success in 'k' attempts, helping you identify flaky prompts and ensure the agent consistently achieves the desired result.

Question 5

Can I use this for non-deterministic or creative tasks?

Accepted Answer

Yes, the harness supports 'Model Graders' where Claude acts as a judge to evaluate open-ended or creative outputs against a specific rubric or set of quality guidelines.

Evaluation-Driven Development Harness

Características Principales

Casos de Uso

Evaluation-Driven Development Harness

Características Principales

Casos de Uso