What is reflection-driven mutation?

It is an optimization technique where the agent analyzes failed outputs to diagnose the root cause of an error before proposing a targeted fix to the skill's instructions.

Does this replace manual prompt engineering?

It automates the repetitive parts of prompt engineering—testing, scoring, and iterating—allowing you to focus on high-level design while the agent handles the fine-tuning.

What is the 'balloon effect' in optimization?

The balloon effect occurs when an improvement in one area causes a regression in another; this skill automatically detects and discards mutations that cause such regressions.

How do binary evals improve my skills?

Binary evals use simple 'yes/no' criteria to eliminate subjectivity in scoring, allowing the optimizer to accurately measure whether a prompt mutation truly improved performance.

Can I use this for skills I'm currently building?

Yes, it is specifically designed to take a skill that works 'most of the time' and refine it until it hits a high performance ceiling, typically 90% or higher.

Skill Auto-Optimizer

Name: Skill Auto-Optimizer
Author: LeeJuOh

byLeeJuOh

•

Herramientas para Desarrolladores

Autonomously enhances Claude Code skills through iterative benchmarking, reflection-driven prompt mutation, and performance scoring.

Skill Auto-Optimizer implements an autonomous research loop to systematically increase the reliability of any Claude Code skill from baseline to production-grade performance. By adapting Andrej Karpathy's autoresearch methodology, the skill repeatedly executes target tasks, scores results against binary evaluation criteria, and uses reflection-driven mutation to diagnose and fix specific failure patterns. It provides a comprehensive optimization environment featuring a live HTML dashboard for real-time progress tracking, structured session archives to prevent redundant experiments, and sophisticated 'stuck detection' to overcome performance plateaus without manual intervention.

Características Principales

01Reflection-driven mutation that diagnoses root causes of failed outputs

02Live HTML dashboard for real-time visualization of improvement trends

03Structured session archives to ensure cross-session experiment continuity

04Automated binary evaluation system for objective performance scoring

05Regression detection to prevent improvements in one area from breaking others

0633 GitHub stars

Casos de Uso

01Automating the iterative prompt engineering cycle for complex workflows

02Refining inconsistent skills that fail on edge cases or specific formatting

03Establishing performance baselines for new AI agents and skills

Características Principales

01Reflection-driven mutation that diagnoses root causes of failed outputs

02Live HTML dashboard for real-time visualization of improvement trends

03Structured session archives to ensure cross-session experiment continuity

04Automated binary evaluation system for objective performance scoring

05Regression detection to prevent improvements in one area from breaking others

0633 GitHub stars

Casos de Uso

01Automating the iterative prompt engineering cycle for complex workflows

02Refining inconsistent skills that fail on edge cases or specific formatting

03Establishing performance baselines for new AI agents and skills