Can I define custom quality metrics?

Yes, you can implement LLM-as-Judge evaluators to score responses based on domain-specific requirements like brand tone, technical accuracy, or regulatory compliance.

Does this skill support production monitoring?

Absolutely. It is designed for both pre-production batch testing and continuous production monitoring, integrating with CloudWatch for proactive quality alerts.

How does this differ from Bedrock Guardrails?

Bedrock Guardrails are primarily used for real-time content filtering and policy enforcement, whereas AgentCore Evaluations are used to measure, score, and monitor the overall quality and effectiveness of agent interactions.

What are the built-in evaluators available in this skill?

The skill includes 13 standardized evaluators covering Correctness, Helpfulness, Tool Selection Accuracy, Tool Parameter Accuracy, Safety, Faithfulness, Goal Success Rate, Context Relevance, Coherence, Conciseness, Stereotype Harm, Maliciousness, and Self-Harm.

Bedrock AgentCore Evaluations

Name: Bedrock AgentCore Evaluations
Author: adaptationio

byadaptationio

0•

Seguridad y Pruebas

Implements metric-based quality assurance and monitoring for Amazon Bedrock AI agents using built-in and custom evaluators.

Bedrock AgentCore Evaluations transitions AI agent development from subjective assessment to rigorous, metric-based quality assurance. It provides 13 standardized evaluators for dimensions like correctness, safety, and tool accuracy, while supporting custom LLM-as-Judge patterns for domain-specific metrics such as brand tone or regulatory compliance. Whether testing agents before production deployment or monitoring live interactions via CloudWatch, this skill ensures AI behaviors remain safe, effective, and aligned with organizational standards through quantifiable scoring and proactive alerting.

Características Principales

01Custom LLM-as-Judge patterns for domain-specific quality metrics

02Quantitative scoring for goal success, faithfulness, and coherence

0313 built-in evaluators for correctness, safety, and tool accuracy

04Continuous production monitoring with CloudWatch alerting

050 GitHub stars

06Pre-production on-demand and batch testing capabilities

Casos de Uso

01Validating agent tool selection and parameter accuracy before deployment

02Monitoring live production interactions for safety and compliance violations

03Establishing baseline performance metrics for complex AI agent workflows

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add adaptationio/skrillz bedrock-agentcore-evaluations

For use in Claude.ai and ChatGPT

Características Principales

01Custom LLM-as-Judge patterns for domain-specific quality metrics

02Quantitative scoring for goal success, faithfulness, and coherence

0313 built-in evaluators for correctness, safety, and tool accuracy

04Continuous production monitoring with CloudWatch alerting

050 GitHub stars

06Pre-production on-demand and batch testing capabilities

Casos de Uso

01Validating agent tool selection and parameter accuracy before deployment

02Monitoring live production interactions for safety and compliance violations

03Establishing baseline performance metrics for complex AI agent workflows

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add adaptationio/skrillz bedrock-agentcore-evaluations

For use in Claude.ai and ChatGPT