How does this skill handle position bias?

It implements a dual-pass protocol where responses are evaluated twice with their positions swapped, using consistency checks to ensure the verdict isn't influenced by presentation order.

Can this skill help with rubric creation?

Yes, it includes frameworks for generating detailed, multi-level rubrics that define clear boundaries and observable characteristics for consistent grading.

What is LLM-as-a-Judge?

It is an evaluation technique where a highly capable Large Language Model is used to grade or compare the outputs of other models based on specific criteria or rubrics.

Should I use direct scoring or pairwise comparison?

Use direct scoring for objective criteria like factual accuracy; use pairwise comparison for subjective qualities like tone, style, and persuasiveness.

Why is justification required before scoring?

Research shows that requiring a model to provide Chain-of-Thought reasoning or evidence before assigning a score increases evaluation reliability by 15-25%.

Advanced LLM Evaluation

Name: Advanced LLM Evaluation
Author: goodnight000

bygoodnight000

Ciencia de Datos y ML

Implements production-grade LLM-as-a-Judge techniques for evaluating AI outputs through rigorous scoring, pairwise comparisons, and bias mitigation.

Acerca de

This skill provides specialized knowledge to build and execute sophisticated evaluation systems for Large Language Models. It synthesizes industry best practices and academic research to implement direct scoring and pairwise comparison methodologies while actively mitigating systematic biases like position and length bias. Whether you are building automated evaluation pipelines, designing complex rubrics, or conducting A/B tests for prompt engineering, this skill ensures your AI assessments are consistent, reliable, and closely aligned with human judgment.

Características Principales

0 GitHub stars
Metric selection guidance including F1, Spearman's ρ, and Cohen's κ
Custom rubric generation for domain-specific quality standards
Automated LLM-as-a-Judge scoring and comparison frameworks
Systematic bias mitigation protocols for position and length bias
Chain-of-Thought integration for transparent evaluation reasoning

Casos de Uso

Developing standardized rubrics for consistent human and AI content moderation
Comparing model performance variations after prompt or hyperparameter changes
Building automated CI/CD evaluation pipelines for LLM applications

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add goodnight000/kittycourt advanced-evaluation

For use in Claude.ai and ChatGPT

Download Skill

GitHub