How does it handle RAG performance measurement?

It provides specific metrics for faithfulness, answer relevance, and context precision to ensure that your RAG system provides grounded and accurate information based on retrieved data.

Does it support standard AI benchmarks?

Yes, it includes implementation templates for popular benchmarks like MMLU (Massive Multitask Language Understanding) and HumanEval for code generation accuracy.

Can this skill help detect AI hallucinations?

Yes, it includes specialized tools for hallucination detection that verify claims against provided source contexts and perform self-consistency checks across multiple generations.

What frameworks does the Evaluation Metrics skill support?

It supports major industry frameworks including RAGAS for retrieval-augmented generation, LangChain evaluators, and standard text metrics like BLEU, ROUGE, and BERTScore.

LLM Evaluation Metrics

Name: LLM Evaluation Metrics
Author: pluginagentmarketplace

bypluginagentmarketplace

•

데이터 과학 및 ML

Measures and optimizes Large Language Model performance through systematic quality frameworks, benchmarks, and hallucination detection.

The Evaluation Metrics skill provides a comprehensive toolkit for AI engineers to assess LLM outputs objectively using industry-standard frameworks like RAGAS and LangChain. It enables systematic tracking of quality across various dimensions, including semantic similarity (BERTScore), RAG-specific health (faithfulness, relevance), and automated hallucination detection. By integrating benchmark suites like MMLU and HumanEval, it allows developers to validate model improvements during A/B testing and ensure production systems meet rigorous quality standards through statistical verification.

주요 기능

01Automated hallucination detection and self-consistency verification

022 GitHub stars

03Semantic evaluation tools including BERTScore, BLEU, and ROUGE-L

04Integrated RAGAS support for faithfulness and context precision metrics

05Statistical A/B testing framework for comparing model versions

06Standardized benchmark suites for MMLU and HumanEval coding tasks

사용 사례

01Benchmarking fine-tuned models against base models using MMLU standards

02Implementing automated quality gates in AI development CI/CD pipelines

03Verifying RAG pipeline accuracy and grounding before production deployment

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add pluginagentmarketplace/custom-plugin-ai-engineer evaluation-metrics

For use in Claude.ai and ChatGPT

Download Skill