Does this skill provide definitive proof of what a feature detects?

No. The overview shows correlations, not causation. It is a starting point for generating hypotheses, and researchers should use experimental tools to reach final conclusions.

What is the main purpose of the MechInterp Overview skill?

It provides a fast, diagnostic summary of Sparse Autoencoder (SAE) features to help researchers form hypotheses based on correlations before conducting deep causal investigations.

What does a high sparsity percentage indicate?

Sparsity represents the percentage of examples where a feature activation is zero. A high sparsity (95%+) means the feature is highly selective and fires rarely.

How does this skill help with 'flanderization' or super-stimuli?

It analyzes activations across different regions. If a token is only present in the top 10% of activations (super-stimuli) but missing from the core 25-75% region, it flags it as a potential tail marker rather than the true concept.

MechInterp Feature Overview

Name: MechInterp Feature Overview
Author: cesaregarza

bycesaregarza

0•

Data Science & ML

Provides a rapid diagnostic summary of Sparse Autoencoder (SAE) features to generate research hypotheses and identify model behaviors.

This skill offers a comprehensive 'first look' at SAE features by calculating activation statistics, PageRank-weighted token influences, and ability family breakdowns. Designed for mechanistic interpretability workflows, it helps researchers distinguish between true core concepts and 'flanderized' super-stimuli by analyzing specific activation regions like the Floor, Core, and High zones. It serves as an essential starting point for investigating how models represent domain-specific data—such as Splatoon NLP data—providing the statistical foundation needed before moving to deeper causal experiments.

Key Features

01Super-Stimuli Detection: Flags 'flanderized' features where the tail activations don't match the core concept.

02PageRank Token Analysis: Identifies top 'enhancer' and 'suppressor' tokens weighted by their importance in high-activation contexts.

03Domain-Specific Aggregation: Breaks down feature activations by ability families and weapon kits.

04Activation Statistics: Computes mean, standard deviation, and sparsity percentages for feature activations.

05ReLU Floor Diagnostics: Automatically warns if a feature is mostly zeros or difficult to interpret.

060 GitHub stars

Use Cases

01Screening features to determine if they represent broad patterns or niche, spurious correlations.

02Initial exploration of newly trained SAEs to identify and label interpretable features.

03Analyzing activation regions to distinguish between a feature's true concept and its extreme outliers.

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add cesaregarza/splatnlp mechinterp-overview

For use in Claude.ai and ChatGPT

Download Skill