Does this skill implement the evaluations directly?

No, this skill is focused on the design and specification phase. It outputs a structured technical plan that a coding agent or developer can then implement using the Langfuse SDK.

When should I use LLM-as-judge versus code-based evaluations?

Use code-based evals for deterministic tasks like JSON formatting or factual matching. Use LLM-as-judge for subjective qualities like helpfulness, tone, or coherence where human-like judgment is required.

Why does this skill recommend binary PASS/FAIL over 1-5 scales?

Likert scales (1-5) often introduce noise and inconsistency. Binary scoring forces you to define exactly what constitutes a failure, leading to more reliable, reproducible, and actionable metrics.

How do I calibrate an LLM judge?

The skill guides you to create 20-30 human-labeled examples, run the LLM judge against them, and adjust your prompts until the judge matches human expert agreement by at least 85%.

LLM Evaluation Designer

Name: LLM Evaluation Designer
Author: tavva

bytavva

보안 및 테스팅

Guides developers through designing production-quality LLM evaluation pipelines and scoring rubrics to ensure AI output reliability.

소개

The eval-design skill provides a systematic framework for moving beyond 'vibes-based' testing to empirical, production-grade AI quality measurement. It guides users through a structured process of identifying real-world failure modes, selecting the appropriate evaluation type—ranging from cost-effective code-based checks to sophisticated LLM-as-judge patterns—and generating precise technical specifications. By focusing on binary pass/fail criteria and calibration, it helps teams build robust golden datasets and automated metrics that are ready for implementation in observability platforms like Langfuse.

주요 기능

Creation of specific, binary scoring rubrics to eliminate measurement noise
Automated generation of Langfuse-compatible implementation specifications
0 GitHub stars
Evidence-based selection between code-based, LLM-as-judge, and human-in-the-loop evals
Calibration strategies for aligning LLM judges with human expert intuition
Structured 4-phase design process: Understand, Identify, Match, and Design

사용 사례

Developing a systematic way to measure and improve LLM factual accuracy or tone
Creating a golden dataset featuring happy path, edge case, and adversarial inputs
Designing custom judge prompts for high-stakes AI decision-making systems

소개

주요 기능

Creation of specific, binary scoring rubrics to eliminate measurement noise
Automated generation of Langfuse-compatible implementation specifications
0 GitHub stars
Evidence-based selection between code-based, LLM-as-judge, and human-in-the-loop evals
Calibration strategies for aligning LLM judges with human expert intuition
Structured 4-phase design process: Understand, Identify, Match, and Design

사용 사례

Developing a systematic way to measure and improve LLM factual accuracy or tone
Creating a golden dataset featuring happy path, edge case, and adversarial inputs
Designing custom judge prompts for high-stakes AI decision-making systems