소개
SAELens provides a specialized framework for training and analyzing Sparse Autoencoders (SAEs), based on Anthropic’s groundbreaking research into monosemanticity. It allows researchers and engineers to solve the problem of polysemanticity—where single neurons represent multiple concepts—by extracting distinct, interpretable features from dense model activations. This skill enables deep visibility into what models have learned, supporting advanced tasks such as feature-based steering, safety analysis, and the discovery of specific semantic concepts within language models.