How does it map evaluations to specific training epochs?

It queries SLURM job names via sacct to parse epoch information (e.g., ep0, ep1) encoded in the job names, allowing for reliable mapping regardless of submission order.

Can I run this before an experiment is fully finished?

Yes, the skill is designed to handle partial results. It will summarize completed runs and list any incomplete or failed runs in a dedicated 'Incomplete Runs' section.

Does this skill require a specific environment?

Yes, you must have the appropriate Conda environment activated that contains inspect-ai and the necessary project tools to allow the log parsing scripts to function.

How does it handle binary classification tasks?

For binary tasks, it utilizes a specialized binary summary script to compute Balanced Accuracy and F1 scores, ensuring performance is accurately reflected even with imbalanced datasets.

What files does this skill require to run?

The skill requires an experiment_summary.yaml file, SLURM output files for training runs, and .eval log files generated by inspect-ai to create a complete summary.

Summarize Experiment

Name: Summarize Experiment
Author: niznik-dev

byniznik-dev

•

数据科学与机器学习

Generates comprehensive Markdown summaries of LLM fine-tuning and evaluation experiments by aggregating metrics from SLURM logs and evaluation files.

The Summarize Experiment skill automates the tedious post-run analysis phase for researchers working with Large Language Models. It intelligently parses experiment configurations, extracts training loss trajectories from SLURM stdout, and gathers accuracy metrics from evaluation logs. By mapping job names to specific epochs and calculating advanced binary classification metrics like F1 and Balanced Accuracy, it transforms fragmented log files into a single, human-readable summary.md file. This tool is essential for researchers needing immediate visibility into model performance across various hyperparameters and training stages.

主要功能

01Comprehensive parsing of inspect-ai .eval logs for multi-task performance tracking.

02Intelligent epoch-to-metric mapping using SLURM job name metadata.

03Advanced binary classification metrics including Balanced Accuracy and F1 scores.

04Automated extraction of training loss and step counts from SLURM output files.

05Consolidated Markdown report generation with structured status, training, and evaluation tables.

0611 GitHub stars

使用场景

01Reviewing the comparative results of a multi-model fine-tuning sweep.

02Identifying the best-performing model epoch across various experimental conditions.

03Diagnosing failed or incomplete runs within large-scale social research experiments.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add niznik-dev/cruijff_kit summarize-experiment

For use in Claude.ai and ChatGPT

主要功能

01Comprehensive parsing of inspect-ai .eval logs for multi-task performance tracking.

02Intelligent epoch-to-metric mapping using SLURM job name metadata.

03Advanced binary classification metrics including Balanced Accuracy and F1 scores.

04Automated extraction of training loss and step counts from SLURM output files.

05Consolidated Markdown report generation with structured status, training, and evaluation tables.

0611 GitHub stars

使用场景

01Reviewing the comparative results of a multi-model fine-tuning sweep.

02Identifying the best-performing model epoch across various experimental conditions.

03Diagnosing failed or incomplete runs within large-scale social research experiments.