When should I use advanced evaluation instead of the base skill?

Use it for high-stakes contexts like public launches, investor materials, or when you suspect the evaluation might be influenced by framing or order bias.

What is position bias in LLM evaluations?

Position bias is the tendency for an AI judge to favor responses based on their order in the prompt. This skill mitigates this by running evaluations in multiple orders and reconciling the results.

How does advanced-evaluation differ from standard AI evaluation?

Standard evaluation is often subjective and prone to simple heuristics. This skill uses multi-pass scoring, position bias mitigation, and adversarial probing to provide objective, high-stakes results.

What does 'Level 3 Evidence' mean in this context?

Level 3 Evidence requires the evaluator to provide exact quotes, explain the specific issue, and define what the ideal state looks like, ensuring feedback is actionable and grounded.

Can I use this skill to compare three or more outputs?

Yes, the skill includes a specific pairwise comparison protocol that uses a round-robin format to rank multiple competing outputs against each other systematically.

Advanced Evaluation (LLM-as-Judge)

Name: Advanced Evaluation (LLM-as-Judge)
Author: nsalvacao

bynsalvacao

0•

데이터 과학 및 ML

Implements high-rigor evaluation frameworks with bias mitigation and pairwise comparisons for mission-critical AI outputs.

This skill transforms Claude into a sophisticated quality judge, applying systematic methodology to assess high-stakes content. It moves beyond simple scoring by implementing dual-pass position bias mitigation, round-robin pairwise comparisons for multiple outputs, and adversarial probing to ensure objective, calibrated results. Whether you are validating investor materials, public-facing documentation, or complex technical explanations, this toolkit provides the evidence-based rigor needed to guarantee production-grade quality through rigorous LLM-as-judge protocols.

주요 기능

01Adversarial Probing: Utilizes 'Steelman' and 'Devil’s Advocate' techniques to stress-test every evaluation.

02Pairwise Comparison Protocol: Conducts systematic round-robin matches to accurately rank multiple competing outputs.

03Calibrated Confidence Scoring: Maps scores to real-world anchors with explicit confidence intervals for every dimension.

04Level 3 Evidence Standards: Mandates specific quotes and contextual justification for scores to eliminate generic feedback.

050 GitHub stars

06Position Bias Mitigation: Executes dual-pass evaluations (forward and reverse) to eliminate ordering influence.

사용 사례

01Comparing multiple AI-generated drafts for high-stakes executive summaries or investor pitches.

02Benchmarking different prompt versions or model outputs to determine the most reliable implementation.

03Validating technical documentation or policy papers against strict quality rubrics before public release.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add nsalvacao/nsalvacao-claude-code-plugins advanced-evaluation

For use in Claude.ai and ChatGPT

주요 기능

01Adversarial Probing: Utilizes 'Steelman' and 'Devil’s Advocate' techniques to stress-test every evaluation.

02Pairwise Comparison Protocol: Conducts systematic round-robin matches to accurately rank multiple competing outputs.

03Calibrated Confidence Scoring: Maps scores to real-world anchors with explicit confidence intervals for every dimension.

04Level 3 Evidence Standards: Mandates specific quotes and contextual justification for scores to eliminate generic feedback.

050 GitHub stars

06Position Bias Mitigation: Executes dual-pass evaluations (forward and reverse) to eliminate ordering influence.

사용 사례

01Comparing multiple AI-generated drafts for high-stakes executive summaries or investor pitches.

02Benchmarking different prompt versions or model outputs to determine the most reliable implementation.

03Validating technical documentation or policy papers against strict quality rubrics before public release.