Can I define custom quality metrics?

Yes, you can implement LLM-as-Judge evaluators to score responses based on domain-specific requirements like brand tone, technical accuracy, or regulatory compliance.

Does this skill support production monitoring?

Absolutely. It is designed for both pre-production batch testing and continuous production monitoring, integrating with CloudWatch for proactive quality alerts.

How does this differ from Bedrock Guardrails?

Bedrock Guardrails are primarily used for real-time content filtering and policy enforcement, whereas AgentCore Evaluations are used to measure, score, and monitor the overall quality and effectiveness of agent interactions.

What are the built-in evaluators available in this skill?

The skill includes 13 standardized evaluators covering Correctness, Helpfulness, Tool Selection Accuracy, Tool Parameter Accuracy, Safety, Faithfulness, Goal Success Rate, Context Relevance, Coherence, Conciseness, Stereotype Harm, Maliciousness, and Self-Harm.

Bedrock AgentCore Evaluations

Name: Bedrock AgentCore Evaluations
Author: adaptationio

byadaptationio

Security & Testing

Implements metric-based quality assurance and monitoring for Amazon Bedrock AI agents using built-in and custom evaluators.

About

Bedrock AgentCore Evaluations transitions AI agent development from subjective assessment to rigorous, metric-based quality assurance. It provides 13 standardized evaluators for dimensions like correctness, safety, and tool accuracy, while supporting custom LLM-as-Judge patterns for domain-specific metrics such as brand tone or regulatory compliance. Whether testing agents before production deployment or monitoring live interactions via CloudWatch, this skill ensures AI behaviors remain safe, effective, and aligned with organizational standards through quantifiable scoring and proactive alerting.

Key Features

Custom LLM-as-Judge patterns for domain-specific quality metrics
Quantitative scoring for goal success, faithfulness, and coherence
13 built-in evaluators for correctness, safety, and tool accuracy
Continuous production monitoring with CloudWatch alerting
0 GitHub stars
Pre-production on-demand and batch testing capabilities

Use Cases

Validating agent tool selection and parameter accuracy before deployment
Monitoring live production interactions for safety and compliance violations
Establishing baseline performance metrics for complex AI agent workflows

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add adaptationio/skrillz bedrock-agentcore-evaluations

For use in Claude.ai and ChatGPT

Download Skill

GitHub