What is a judge agent in the context of tdx agent-test?

A judge agent is an LLM configured to evaluate your agent's responses against the criteria you've defined in your test.yml file, providing a binary pass/fail result based on your requirements.

Can I test conversations that span multiple turns?

Yes, the skill supports multi-round tests where you can define consecutive user inputs and specific criteria for each individual response in the sequence to test context and memory.

Where should I store my test files?

You should create a test.yml file within your agent's directory, typically alongside your agent.yml and prompt.md files, following the project structure.

How does the re-evaluation workflow save time?

Using the --reeval flag, you can update your testing criteria and re-run the evaluation against cached conversations without needing to generate new, time-consuming LLM responses.

Can I filter which tests to run?

Yes, you can use the --name flag to target specific tests or the --tags flag to run groups of tests categorized by labels like 'smoke' or 'regression'.

Agent Testing Utility

Name: Agent Testing Utility
Author: treasure-data

bytreasure-data

•

Seguridad y Pruebas

Automates the testing and evaluation of LLM agents using YAML-defined scenarios and AI-powered judge criteria.

Acerca de

The agent-test skill provides a robust framework for validating LLM agent behavior within the Treasure Data ecosystem. It allows developers to define complex, multi-round interaction scenarios in YAML and utilize a specialized judge agent to evaluate responses against specific, measurable criteria. This tool is essential for regression testing, refining agent prompts, and ensuring consistent performance across diverse user inputs without the need for manual oversight, significantly speeding up the agent development lifecycle.

Características Principales

Granular test filtering using tags and specific name-based execution.
Dry-run and no-eval modes for syntax validation and conversation logging.
8 GitHub stars
Efficient re-evaluation workflow to iterate on criteria without re-running LLM calls.
Automated YAML-based test definitions for single and multi-round conversations.
AI-powered judge agent for objective binary pass/fail evaluation of responses.

Casos de Uso

Validating multi-step conversational flows and memory retention in complex agents.
Performing regression testing on agents after updating system prompts or data sources.
Standardizing the quality assurance process for LLM-based applications and workflows.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add treasure-data/td-skills agent-test

For use in Claude.ai and ChatGPT

Download Skill

GitHub