What is Evidently.ai?

Evidently.ai is an open-source framework designed to evaluate, test, and monitor machine learning models, with specific tools for analyzing LLM quality and text data.

Can I use local models for LLM evaluation?

Yes, the skill supports local providers like Ollama through the OpenAI-compatible provider interface, allowing for cost-effective local evaluation.

How does prompt optimization work in this skill?

The skill uses a PromptOptimizer to iteratively test prompt variations against a target dataset, using an LLM-as-a-Judge to score outputs and find the most effective template.

Is it possible to evaluate RAG systems?

Yes, the skill includes patterns specifically for RAG, such as judging if an answer is relevant to a question and factually accurate based on the provided context.

What metrics can I track with text descriptors?

You can track character counts, word counts, sentiment scores, JSON/Python syntax validity, and custom regex patterns to ensure output consistency.

LLM Evaluation with Evidently.ai

Name: LLM Evaluation with Evidently.ai
Author: atrawog

byatrawog

0•

データサイエンスとML

Evaluates LLM output quality and optimizes prompt templates using Evidently.ai metrics and LLM-as-a-Judge patterns.

The evaluation skill provides a robust framework for assessing and improving LLM performance within Jupyter environments. By integrating Evidently.ai, it allows developers to implement sophisticated evaluation pipelines featuring text descriptors, aggregate reports, and automated prompt optimization. Whether you are building RAG systems, comparing model outputs, or fine-tuning prompts for classification tasks, this skill provides the necessary tools for data-driven AI development and quality assurance.

主な機能

01Built-in reporting for metric drift and aggregate performance tracking

02Specialized evaluation templates for RAG relevance and thinking models

030 GitHub stars

04Automated prompt optimization for classification and generation tasks

05Comprehensive text descriptors including sentiment, JSON validity, and regex

06LLM-as-a-Judge patterns for qualitative assessment and reasoning

ユースケース

01Comparing performance between different LLM models or prompt versions

02Iteratively refining prompt templates to improve classification accuracy

03Monitoring RAG pipeline quality by assessing factuality and relevance

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add atrawog/overthink-plugins evaluation

For use in Claude.ai and ChatGPT

Download Skill