Can I use this skill for RAG pipelines?

Yes, the skill includes specialized patterns for evaluating Retrieval-Augmented Generation (RAG) quality, including factuality and context relevance checks.

What kind of reports can I generate?

You can generate detailed HTML reports for visual inspection or JSON reports for automated CI/CD pipelines and performance tracking.

What is Evidently.ai in this context?

Evidently.ai is an open-source framework used by this skill to calculate row-level metrics (descriptors) and generate aggregate reports for LLM performance monitoring.

Does it support local LLM models?

Yes, it is designed to work with Ollama and other OpenAI-compatible providers, allowing for local, private evaluation of your models.

How does prompt optimization work?

The skill uses a PromptOptimizer that iteratively refines your prompt templates by comparing results against a ground-truth dataset using an LLM-as-a-judge.

LLM Evaluation & Optimization

Name: LLM Evaluation & Optimization
Author: atrawog

byatrawog

0•

データサイエンスとML

Evaluates LLM outputs and optimizes prompts using Evidently.ai metrics and LLM-as-a-judge patterns.

This skill integrates Evidently.ai into the Claude Code workflow to provide a robust framework for assessing and improving LLM performance. It enables developers to implement automated quality checks through text descriptors, set up sophisticated LLM-as-a-judge evaluators for qualitative metrics, and perform automated prompt tuning. Whether you're measuring RAG accuracy, comparing model variations, or monitoring production quality, this skill provides the necessary tools to transition from subjective assessment to data-driven LLM development within your Jupyter environment.

主な機能

01Automated prompt optimization for classification and generation tasks

02LLM-as-a-Judge patterns for automated qualitative assessment

03Row-level text descriptors for sentiment, length, and validity

04Comprehensive HTML and JSON reporting for performance monitoring

05RAG quality evaluation focusing on relevance and factuality

060 GitHub stars

ユースケース

01Evaluating RAG pipeline output for hallucination and relevance

02Optimizing system prompts for higher classification accuracy

03Benchmarking performance differences between model versions

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add atrawog/bazzite-ai-plugins evaluation

For use in Claude.ai and ChatGPT

Download Skill