How does the LLM-as-judge pattern work?

It utilizes a high-capability model to evaluate the outputs of a target model against a specific rubric, providing objective scoring and qualitative feedback without manual human review.

What is the '95% Variance Finding' mentioned in the skill?

Research indicates that 80% of LLM output variance stems from prompt construction and 15% from random seeds/sampling. This skill focuses on these areas as they represent 95% of the impact on output quality.

Does it include templates for quick setup?

Yes, the skill includes pre-defined templates for rubrics, judge prompts, and structured test cases, along with checklists to ensure your evaluation framework is robust.

Can this skill help with non-deterministic outputs?

Yes, it provides strategies such as running multiple iterations to report mean/variance and using seed control to improve reproducibility during the testing phase.

Grey Haven LLM Evaluation

Name: Grey Haven LLM Evaluation
Author: greyhaven-ai

bygreyhaven-ai

•

보안 및 테스팅

Systematizes LLM output evaluation using multi-dimensional rubrics, LLM-as-judge patterns, and statistical variance handling.

소개

This skill provides a comprehensive framework for testing and validating Large Language Model (LLM) outputs within the Claude Code environment. It implements the '95% Variance Finding'—the research-backed insight that prompt quality and sampling account for nearly all output variation—to focus developer efforts where they matter most. By providing templates for multi-dimensional rubrics, automated LLM-as-judge prompts, and strategies for handling non-determinism, it enables developers to build rigorous quality gates, perform A/B testing on prompts, and detect regressions in AI-powered production systems.

주요 기능

LLM-as-judge implementation for automated, scalable validation
Structured test case design templates with ground-truth support
Multi-dimensional scoring rubrics for granular output analysis
Statistical methods for managing non-deterministic model behavior
16 GitHub stars
Ready-to-use checklists for evaluation setup and rubric validation

사용 사례

Performing A/B testing on prompt versions to optimize output quality
Detecting performance regressions after updating model versions or system prompts
Building automated quality gates for production-grade LLM pipelines

소개

주요 기능

LLM-as-judge implementation for automated, scalable validation
Structured test case design templates with ground-truth support
Multi-dimensional scoring rubrics for granular output analysis
Statistical methods for managing non-deterministic model behavior
16 GitHub stars
Ready-to-use checklists for evaluation setup and rubric validation

사용 사례

Performing A/B testing on prompt versions to optimize output quality
Detecting performance regressions after updating model versions or system prompts
Building automated quality gates for production-grade LLM pipelines