Does this skill implement the evaluations directly?

No, this skill is focused on the design and specification phase. It outputs a structured technical plan that a coding agent or developer can then implement using the Langfuse SDK.

When should I use LLM-as-judge versus code-based evaluations?

Use code-based evals for deterministic tasks like JSON formatting or factual matching. Use LLM-as-judge for subjective qualities like helpfulness, tone, or coherence where human-like judgment is required.

Why does this skill recommend binary PASS/FAIL over 1-5 scales?

Likert scales (1-5) often introduce noise and inconsistency. Binary scoring forces you to define exactly what constitutes a failure, leading to more reliable, reproducible, and actionable metrics.

How do I calibrate an LLM judge?

The skill guides you to create 20-30 human-labeled examples, run the LLM judge against them, and adjust your prompts until the judge matches human expert agreement by at least 85%.

LLM Evaluation Designer

Name: LLM Evaluation Designer
Author: tavva

bytavva

0•

セキュリティとテスト

Guides developers through designing production-quality LLM evaluation pipelines and scoring rubrics to ensure AI output reliability.

The eval-design skill provides a systematic framework for moving beyond 'vibes-based' testing to empirical, production-grade AI quality measurement. It guides users through a structured process of identifying real-world failure modes, selecting the appropriate evaluation type—ranging from cost-effective code-based checks to sophisticated LLM-as-judge patterns—and generating precise technical specifications. By focusing on binary pass/fail criteria and calibration, it helps teams build robust golden datasets and automated metrics that are ready for implementation in observability platforms like Langfuse.

主な機能

01Creation of specific, binary scoring rubrics to eliminate measurement noise

02Automated generation of Langfuse-compatible implementation specifications

030 GitHub stars

04Evidence-based selection between code-based, LLM-as-judge, and human-in-the-loop evals

05Calibration strategies for aligning LLM judges with human expert intuition

06Structured 4-phase design process: Understand, Identify, Match, and Design

ユースケース

01Developing a systematic way to measure and improve LLM factual accuracy or tone

02Creating a golden dataset featuring happy path, edge case, and adversarial inputs

03Designing custom judge prompts for high-stakes AI decision-making systems

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add tavva/ben-claude-plugins eval-design

For use in Claude.ai and ChatGPT

Download Skill