How does it handle RAG performance measurement?

It provides specific metrics for faithfulness, answer relevance, and context precision to ensure that your RAG system provides grounded and accurate information based on retrieved data.

Does it support standard AI benchmarks?

Yes, it includes implementation templates for popular benchmarks like MMLU (Massive Multitask Language Understanding) and HumanEval for code generation accuracy.

Can this skill help detect AI hallucinations?

Yes, it includes specialized tools for hallucination detection that verify claims against provided source contexts and perform self-consistency checks across multiple generations.

What frameworks does the Evaluation Metrics skill support?

It supports major industry frameworks including RAGAS for retrieval-augmented generation, LangChain evaluators, and standard text metrics like BLEU, ROUGE, and BERTScore.

LLM Evaluation Metrics

Name: LLM Evaluation Metrics
Author: pluginagentmarketplace

bypluginagentmarketplace

•

Ciencia de Datos y ML

Measures and optimizes Large Language Model performance through systematic quality frameworks, benchmarks, and hallucination detection.

Acerca de

The Evaluation Metrics skill provides a comprehensive toolkit for AI engineers to assess LLM outputs objectively using industry-standard frameworks like RAGAS and LangChain. It enables systematic tracking of quality across various dimensions, including semantic similarity (BERTScore), RAG-specific health (faithfulness, relevance), and automated hallucination detection. By integrating benchmark suites like MMLU and HumanEval, it allows developers to validate model improvements during A/B testing and ensure production systems meet rigorous quality standards through statistical verification.

Características Principales

Automated hallucination detection and self-consistency verification
2 GitHub stars
Semantic evaluation tools including BERTScore, BLEU, and ROUGE-L
Integrated RAGAS support for faithfulness and context precision metrics
Statistical A/B testing framework for comparing model versions
Standardized benchmark suites for MMLU and HumanEval coding tasks

Casos de Uso

Benchmarking fine-tuned models against base models using MMLU standards
Implementing automated quality gates in AI development CI/CD pipelines
Verifying RAG pipeline accuracy and grounding before production deployment

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add pluginagentmarketplace/custom-plugin-ai-engineer evaluation-metrics

For use in Claude.ai and ChatGPT

Download Skill

GitHub