Can I use custom tasks for the integration test?

Yes, the skill allows you to either provide a custom task/prompt or use an auto-generation tool to create a relevant task for your repository.

What happens if the agent hits a sandbox restriction?

If the agentic loop detects sandbox-blocked tools, it will flag the run as requiring escalation, allowing you to re-run the test with the necessary permissions.

Does this skill require the Codex CLI?

Yes, you must have the Codex CLI installed and authenticated within your local environment to run the agentic loops and evaluations.

How are the test results presented?

Results are saved to a timestamped directory containing JSON summaries, execution logs, deterministic check reports, and human-readable summaries.

What does the Codex Readiness Integration Test evaluate?

It evaluates the quality of agentic execution, specifically focusing on the agent's ability to handle real filesystem edits, pass build/test plans, and use repository context effectively.

Codex Readiness Integration Test

Name: Codex Readiness Integration Test
Author: x-cmd

byx-cmd

•

Seguridad y Pruebas

Executes multi-stage integration tests to validate the performance and execution quality of agentic AI coding loops.

The Codex Readiness Integration Test skill provides a comprehensive framework for benchmarking and validating AI agents within a real-world development environment. It orchestrates a complete workflow including task generation, agentic execution via the Codex CLI, and automated evaluation using both deterministic build-test checks and qualitative LLM scoring. By capturing detailed evidence and providing structured feedback, this skill helps developers ensure their agentic tools are reliable, context-aware, and capable of producing production-grade code modifications.

Características Principales

01Detailed reporting including agentic summaries, logs, and evaluation results

02End-to-end agentic loop execution with build and test scoring

03Deterministic verification of filesystem edits and git diffs

049 GitHub stars

05Automated task generation with human-in-the-loop prompt approval

06Qualitative LLM-based evaluation of context usage and execution quality

Casos de Uso

01Benchmarking the reliability of AI agents on complex repository-level tasks

02Validating automated code modification workflows before production deployment

03Generating structured performance data for AI-driven development tools

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add x-cmd/skill codex-readiness-integration-test

For use in Claude.ai and ChatGPT

Download Skill