How does it help with regression testing?

It guides you through creating stable 'golden sets' which are used to compare new model outputs against verified historical data.

Can I use this for LLM-as-judge workflows?

Yes, it includes specific steps for creating judge prompts and integrating them with Langfuse's prompt management system.

Does this require external scripts?

Yes, it utilizes the dataset_manager.py script located in the mberto-compound repository to execute dataset creation commands via the CLI.

What evaluation dimensions can I configure?

You can set up dimensions such as accuracy, helpfulness, relevance, safety, tone, and completeness, or define your own custom metrics.

What is the Langfuse Dataset Setup skill?

It is a Claude Code skill designed to automate the creation and configuration of Langfuse datasets for model testing and evaluation.

Langfuse Dataset Setup

Name: Langfuse Dataset Setup
Author: mberto10

bymberto10

•

Analytics & Monitoring

Configures comprehensive Langfuse datasets with custom evaluation dimensions and LLM-as-judge prompts.

This skill streamlines the process of establishing evaluation environments within Langfuse, helping developers define regression sets, A/B tests, and golden datasets for AI models. It guides users through requirement gathering, CLI-based dataset creation, and the configuration of scoring mechanisms. By integrating with judge prompts, it enables automated LLM-as-judge workflows, ensuring that your AI applications maintain high standards of accuracy, relevance, and safety throughout the development lifecycle.

Key Features

01Supports both human review and automated scoring configurations

02Facilitates golden set creation for regression and A/B testing

03Streamlines LLM-as-judge prompt setup and management

04Automated dataset creation with purpose-driven metadata

05Configures evaluation dimensions like accuracy, tone, and completeness

061 GitHub stars

Use Cases

01Establishing a golden dataset for LLM regression testing

02Setting up an automated LLM-as-judge framework for production monitoring

03Configuring A/B testing dimensions for comparing different prompt versions

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add mberto10/mberto-compound langfuse-dataset-setup

For use in Claude.ai and ChatGPT

Download Skill