What automated metrics are supported by this skill?

The skill provides implementations for standard NLP metrics including BLEU, ROUGE, and METEOR, as well as advanced embedding-based metrics like BERTScore for measuring semantic similarity.

Does it support regression testing for AI models?

Absolutely. It includes a RegressionDetector framework designed to compare new model outputs against a baseline and flag statistically significant drops in performance.

Can I use this for RAG (Retrieval-Augmented Generation) apps?

Yes, the skill includes specialized patterns for RAG evaluation, including measuring groundedness (faithfulness to context) and retrieval metrics like MRR and NDCG.

How does the LLM-as-Judge pattern work?

It utilizes a high-capability model (such as Claude 3.5 Sonnet or GPT-4) to grade the outputs of other models based on defined rubrics like accuracy, helpfulness, and clarity.

LLM Evaluation Framework

Name: LLM Evaluation Framework
Author: Tahir-yamin

byTahir-yamin

•

データサイエンスとML

Implements systematic evaluation strategies for LLM applications using automated metrics, human feedback, and comparative benchmarking.

The LLM Evaluation skill provides a comprehensive toolkit for assessing the performance, quality, and reliability of Large Language Model applications. It equips developers with the methodologies needed to move beyond 'vibes-based' testing by calculating automated metrics like BERTScore and ROUGE, implementing LLM-as-judge patterns, and conducting rigorous A/B testing. Whether you are validating prompt engineering changes, benchmarking different foundation models, or building a RAG pipeline, this skill provides the statistical framework and code patterns required to establish production-grade quality baselines and detect performance regressions.

主な機能

013 GitHub stars

02Human evaluation frameworks with inter-rater agreement (Cohen's Kappa)

03LLM-as-Judge patterns for pointwise and pairwise assessment

04RAG-specific metrics for groundedness, retrieval, and factuality

05Statistical A/B testing with t-tests and effect size calculations

06Automated text generation metrics (BLEU, ROUGE, BERTScore)

ユースケース

01Validating the impact of system prompt changes on application accuracy and safety

02Establishing continuous integration (CI) tests to flag performance regressions before deployment

03Comparing model performance and cost-efficiency when migrating between LLM providers

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add tahir-yamin/dev-engineering-playbook llm-evaluation

For use in Claude.ai and ChatGPT

Download Skill