Does this skill support structured output validation?

Yes, it includes specific patterns for validating JSON schemas and ensuring that LLM responses match your application's expected data structure.

What are the recommended quality metrics for AI applications?

This skill prioritizes Answer Relevancy (≥0.7), Faithfulness (≥0.8), low Hallucination scores (≤0.3), and high Context Precision (≥0.7) to ensure high-quality outputs.

Can I use this for RAG (Retrieval-Augmented Generation) applications?

Absolutely. It integrates with RAGAS and DeepEval specifically to measure retrieval quality, context usage, and the accuracy of generated answers based on provided context.

Why shouldn't I test against live LLM APIs in CI?

Testing against live APIs is slow, expensive, and non-deterministic, often leading to flaky CI/CD pipelines and unpredictable costs. Mocking or using VCR recordings provides faster, reliable results.

AI & LLM Testing Patterns

Name: AI & LLM Testing Patterns
Author: yonatangross

byyonatangross

•

セキュリティとテスト

Implements deterministic testing patterns for AI applications using DeepEval, RAGAS, and advanced mocking strategies.

This skill provides a comprehensive toolkit for validating LLM-based applications by enforcing industry-standard best practices like response mocking, quality metric evaluation, and asynchronous timeout handling. It enables developers to transition from flaky, expensive live API tests to deterministic unit and integration tests using frameworks like DeepEval and VCR.py. By automating the validation of structured outputs and RAG pipelines, it ensures that AI-driven features meet production-grade standards for accuracy, faithfulness, and reliability.

主な機能

0129 GitHub stars

02Structured JSON output and schema verification

03DeepEval and RAGAS integration for quality metrics (Faithfulness, Relevancy)

04Deterministic LLM response mocking and stubbing

05Automated async timeout and error handling validation

06VCR.py recording for reliable, offline integration tests

ユースケース

01Validating RAG pipeline accuracy using Faithfulness and Context Precision metrics

02Creating cost-effective CI/CD pipelines by mocking expensive LLM API calls

03Ensuring AI-generated JSON outputs strictly adhere to defined application schemas

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add yonatangross/skillforge-claude-plugin llm-testing

For use in Claude.ai and ChatGPT

Download Skill