What is the difference between a 'judge' and a 'command' evaluator?

A 'judge' uses an LLM to score subjective quality like tone or clarity, while a 'command' uses local scripts or CLI tools for deterministic, rule-based testing.

What is an evaluator in this context?

An evaluator is a script or service that takes an artifact as input and returns a numerical score and qualitative feedback to guide the gepa optimization process.

What output format does the evaluator require?

The generated scripts return a JSON object containing a mandatory float 'score' (typically 0.0 to 1.0) and optional fields for reasoning and diagnostic metrics.

What is a composite evaluator?

A composite evaluator first checks hard deterministic constraints (like a linter) and only proceeds to LLM-based scoring if those initial safety gates pass.

Can I use this for benchmarking against datasets?

Yes, by using the --dataset flag, you can generate evaluators that compare candidate outputs against ground-truth examples provided in a JSONL dataset.

Optimization Evaluator Generator

Name: Optimization Evaluator Generator
Author: ASRagab

byASRagab

0•

数据科学与机器学习

Generates specialized evaluator scripts to score and provide diagnostic feedback for AI prompts, code, and configurations.

This skill automates the creation of scoring functions for the gepa optimization framework, enabling developers to build robust evaluation loops for AI-generated artifacts. It supports multiple evaluator patterns—including LLM-based judges, deterministic bash commands, and composite filters—ensuring that candidate outputs meet specific quality dimensions and objective constraints. By scaffolding the contract between the optimizer and the artifact, it allows for high-fidelity reflections that significantly improve the quality of prompts, code, and system configurations.

主要功能

01Composite evaluator patterns for combining hard constraints with LLM scoring

02Support for deterministic bash-based scoring scaffolds

03Automated generation of Python-based LLM judge evaluators

040 GitHub stars

05Dataset-aware template generation for ground-truth comparisons

06Customizable diagnostic feedback for iterative artifact reflection

使用场景

01Building a composite evaluator that checks for code syntax before scoring logic

02Scaffolding an HTTP-based evaluation service for distributed optimization workflows

03Creating an LLM judge to score the clarity and tone of prompt templates

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add asragab/optimize-anything generate-evaluator

For use in Claude.ai and ChatGPT

主要功能

01Composite evaluator patterns for combining hard constraints with LLM scoring

02Support for deterministic bash-based scoring scaffolds

03Automated generation of Python-based LLM judge evaluators

040 GitHub stars

05Dataset-aware template generation for ground-truth comparisons

06Customizable diagnostic feedback for iterative artifact reflection

使用场景

01Building a composite evaluator that checks for code syntax before scoring logic

02Scaffolding an HTTP-based evaluation service for distributed optimization workflows

03Creating an LLM judge to score the clarity and tone of prompt templates

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add asragab/optimize-anything generate-evaluator

For use in Claude.ai and ChatGPT