Acerca de
The llm-evaluation skill provides a comprehensive framework for measuring and improving the quality of AI applications. It bridges the gap between raw model output and production-ready performance by implementing systematic evaluation strategies, including automated n-gram and embedding metrics, LLM-as-judge patterns for semantic validation, and human-in-the-loop annotation structures. Whether you are benchmarking different models, detecting performance regressions, or validating RAG pipelines, this skill equips developers with the statistical and programmatic tools needed to establish reliable baselines and ensure model outputs remain accurate, safe, and helpful over time.