Can I use my own custom evaluation script?

Yes, the setup allows you to specify any shell command (e.g., 'python eval.py' or 'npm test') as the evaluation mechanism for your experiment.

What is an autoresearch experiment?

An autoresearch experiment is a structured workflow where an AI agent iteratively modifies a file to improve a specific metric, such as speed or accuracy, validated by an automated evaluation command.

What kind of metrics can I optimize for?

You can optimize for any numerical metric where either a lower value is better (like latency or file size) or a higher value is better (like test pass rates or engagement scores).

Where does the skill store the experiment configuration?

During setup, you can choose to store configurations locally in the project's '.autoresearch/' directory or globally in your user home directory.

Autoresearch Experiment Setup

Name: Autoresearch Experiment Setup
Author: alirezarezvani

byalirezarezvani

•

9,958

•

分析与监控

Initializes automated research experiments by configuring targets, evaluation metrics, and optimization goals for Claude Code.

The setup skill for the Engineering Autoresearch Agent enables users to create structured optimization experiments within Claude Code. It guides users through defining an experiment domain, selecting a target file, establishing a baseline evaluation command (such as pytest or custom benchmarks), and choosing a metric direction. Whether used interactively or via CLI, it creates the necessary configuration and directory structure to start autonomous iteration and performance tuning for code, marketing content, or LLM system prompts.

主要功能

019,958 GitHub stars

02Built-in evaluators for speed, size, pass rate, and LLM-based quality judging

03Support for multiple domains including engineering, content, and prompt engineering

04Interactive experiment configuration wizard for rapid setup

05Automated baseline verification to ensure evaluation commands work correctly

06Flexible storage scopes for both project-specific and user-wide experiments

使用场景

01Optimizing API performance and reducing p50 latency in backend services

02Automating A/B testing for LLM system prompts to improve response quality scores

03Benchmarking and reducing build sizes or memory usage in software development

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add alirezarezvani/claude-skills setup

For use in Claude.ai and ChatGPT