What is LLM-as-judge?

It is a technique where a high-capability model (the judge) evaluates the output of another model based on a provided rubric or set of criteria.

How do I start an evaluation project?

You can initialize a new project by running 'npx promptfoo@latest init', which creates the necessary configuration files to start testing your prompts.

Promptfoo is an open-source CLI tool used to test and evaluate LLM output quality by comparing different prompts and models against predefined test cases.

Can I use custom Python scripts for validation?

Yes, Promptfoo supports custom Python assertions, allowing you to write specialized logic to score outputs based on your specific requirements.

How can I preview prompts without incurring API costs?

By using the 'echo' provider in your configuration, you can see how your prompts render with variables and few-shot examples without calling a real LLM API.

Promptfoo LLM Evaluation

Name: Promptfoo LLM Evaluation
Author: nguyendinhquocx

bynguyendinhquocx

0•

安全与测试

Automates LLM output testing and prompt optimization using the Promptfoo framework for systematic benchmarking.

This skill integrates the Promptfoo framework into your development workflow, enabling systematic evaluation of LLM performance across various models and prompts. It provides structured guidance for creating configuration files, implementing custom Python assertions, and utilizing LLM-as-judge rubrics to ensure high-quality, consistent AI outputs. By streamlining the benchmarking process, this skill helps developers manage complex few-shot examples, perform regression testing on prompts, and make data-driven decisions about model selection and prompt engineering strategies.

主要功能

01LLM-as-judge (llm-rubric) implementation for qualitative output scoring

020 GitHub stars

03Cost-free preview mode using the Echo provider to verify prompt rendering

04Automated prompt benchmarking across multiple providers like Anthropic and OpenAI

05Custom Python assertion support for specialized metric and logic validation

06Comprehensive few-shot example management for complex prompt patterns

使用场景

01Validating long-form content generation using custom text reduction and accuracy metrics

02Regression testing prompts to ensure updates improve performance without breaking existing logic

03Comparing output quality and latency between different model versions for specific use cases

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add nguyendinhquocx/code-ai promptfoo-evaluation

For use in Claude.ai and ChatGPT

Download Skill