Can I test benchmarks with specific Claude models?

Yes, the 'test_benchmark.sh' script allows you to specify both the benchmark ID and the model (e.g., claude-haiku-4-5) to perform cost-effective and targeted testing.

How do I fix a benchmark showing a 0% pass rate?

First, run 'check_benchmark.sh' to see if you accidentally used 'prompt' instead of 'task_prompt'. Then, use 'show_full_prompt.sh' to inspect the final text the model receives to identify syntax or instruction gaps.

What is the difference between prompt and task_prompt in this skill?

The 'task_prompt' field appends your specific instructions to the AILANG teaching prompt, whereas 'prompt' replaces it entirely. You should use 'task_prompt' in almost all cases to ensure the model retains knowledge of AILANG syntax.

What are valid capability names for AILANG benchmarks?

The supported capabilities you can define in your YAML include IO (Input/Output), FS (File System), Clock, and Net (Networking).

Benchmark Manager for AILANG

Name: Benchmark Manager for AILANG
Author: sunholo-data

bysunholo-data

•

数据科学与机器学习

Creates, manages, and debugs AILANG evaluation benchmarks to ensure high-fidelity AI reasoning and syntax performance.

The Benchmark Manager skill provides a specialized framework for managing AILANG evaluation benchmarks, ensuring that AI models correctly learn and apply AILANG syntax during testing. It offers critical debugging workflows for common evaluation issues, distinguishes between task-specific and global prompts to prevent syntax failures, and includes automated validation scripts to verify benchmark YAML configurations. This tool is essential for developers working within the AILANG ecosystem who need to measure model performance across diverse capabilities such as I/O, file systems, and network operations while maintaining strict prompt integrity.

主要功能

01Standardized AILANG benchmark creation and YAML validation

02Automated scripts for testing benchmarks against specific Claude models

03Advanced prompt debugging to visualize exactly what models see during evaluation

04Comprehensive capability testing for I/O, File Systems, and Networking

05Built-in checks for the critical 'task_prompt' vs 'prompt' configuration

0620 GitHub stars

使用场景

01Debugging failing benchmarks where models revert to Python or pseudo-code

02Creating new evaluation suites to test AILANG language feature adoption

03Validating benchmark YAML files for schema accuracy and required capability flags

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add sunholo-data/ailang benchmark-manager

For use in Claude.ai and ChatGPT

Download Skill