Can I use this skill for RAG pipelines?

Yes, the skill includes specialized patterns for evaluating Retrieval-Augmented Generation (RAG) quality, including factuality and context relevance checks.

What kind of reports can I generate?

You can generate detailed HTML reports for visual inspection or JSON reports for automated CI/CD pipelines and performance tracking.

What is Evidently.ai in this context?

Evidently.ai is an open-source framework used by this skill to calculate row-level metrics (descriptors) and generate aggregate reports for LLM performance monitoring.

Does it support local LLM models?

Yes, it is designed to work with Ollama and other OpenAI-compatible providers, allowing for local, private evaluation of your models.

How does prompt optimization work?

The skill uses a PromptOptimizer that iteratively refines your prompt templates by comparing results against a ground-truth dataset using an LLM-as-a-judge.

LLM Evaluation & Optimization

Name: LLM Evaluation & Optimization
Author: atrawog

byatrawog

数据科学与机器学习

Evaluates LLM outputs and optimizes prompts using Evidently.ai metrics and LLM-as-a-judge patterns.

关于

This skill integrates Evidently.ai into the Claude Code workflow to provide a robust framework for assessing and improving LLM performance. It enables developers to implement automated quality checks through text descriptors, set up sophisticated LLM-as-a-judge evaluators for qualitative metrics, and perform automated prompt tuning. Whether you're measuring RAG accuracy, comparing model variations, or monitoring production quality, this skill provides the necessary tools to transition from subjective assessment to data-driven LLM development within your Jupyter environment.

主要功能

Automated prompt optimization for classification and generation tasks
LLM-as-a-Judge patterns for automated qualitative assessment
Row-level text descriptors for sentiment, length, and validity
Comprehensive HTML and JSON reporting for performance monitoring
RAG quality evaluation focusing on relevance and factuality
0 GitHub stars

使用场景

Evaluating RAG pipeline output for hallucination and relevance
Optimizing system prompts for higher classification accuracy
Benchmarking performance differences between model versions

关于

主要功能

Automated prompt optimization for classification and generation tasks
LLM-as-a-Judge patterns for automated qualitative assessment
Row-level text descriptors for sentiment, length, and validity
Comprehensive HTML and JSON reporting for performance monitoring
RAG quality evaluation focusing on relevance and factuality
0 GitHub stars

使用场景

Evaluating RAG pipeline output for hallucination and relevance
Optimizing system prompts for higher classification accuracy
Benchmarking performance differences between model versions