Can I use this for RAG applications?

Absolutely. This skill includes specific patterns for RAGAS integration to measure specialized metrics like faithfulness, context precision, and answer relevancy.

What kind of feedback can I capture from users?

You can capture boolean feedback (like thumbs up/down), numerical scores (0.0 to 1.0), and qualitative text comments for both entire traces and specific model generations.

Do I need to set up tracing before using this skill?

Yes, this skill is a secondary workflow that assumes you have already implemented basic tracing (Workflow A) so that you have valid trace IDs to attach scores to.

How does the LLM-as-Judge feature work?

It uses a separate LLM call to evaluate a response against a specific rubric, then automatically writes those results back to Langfuse as structured scores for analysis.

Langfuse Evaluation & Scoring

Name: Langfuse Evaluation & Scoring
Author: jeremylongshore

byjeremylongshore

•

1,449

•

Analytics & Monitoring

Implements comprehensive LLM evaluation workflows, including manual scoring, user feedback capture, and automated LLM-as-judge patterns.

This skill provides a standardized framework for monitoring and improving AI model performance using Langfuse. It enables developers to integrate manual scoring systems, capture real-time user feedback, and set up sophisticated automated evaluation pipelines using LLM-as-judge techniques or RAGAS. By bridging the gap between raw LLM outputs and actionable quality metrics, it helps teams maintain high standards for their AI applications, conduct regression testing via datasets, and build robust benchmarks for production-grade AI systems.

Key Features

01RAGAS integration for specialized RAG performance metrics

02Dataset-based evaluation runs for regression testing

03Automated LLM-as-Judge patterns for scalable quality monitoring

04Manual trace and generation scoring for human evaluation

051,449 GitHub stars

06User feedback integration with boolean and categorical ratings

Use Cases

01Building a human-in-the-loop review system for AI responses

02Automating quality benchmarks for RAG pipelines using RAGAS

03Implementing production 'thumbs up/down' feedback loops

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add jeremylongshore/claude-code-plugins-plus-skills langfuse-core-workflow-b

For use in Claude.ai and ChatGPT

Download Skill

Key Features

01RAGAS integration for specialized RAG performance metrics

02Dataset-based evaluation runs for regression testing

03Automated LLM-as-Judge patterns for scalable quality monitoring

04Manual trace and generation scoring for human evaluation

051,449 GitHub stars

06User feedback integration with boolean and categorical ratings

Use Cases

01Building a human-in-the-loop review system for AI responses

02Automating quality benchmarks for RAG pipelines using RAGAS

03Implementing production 'thumbs up/down' feedback loops

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add jeremylongshore/claude-code-plugins-plus-skills langfuse-core-workflow-b

For use in Claude.ai and ChatGPT

Download Skill