01Custom vocabulary training from large-scale text iterators
02High-speed Rust core capable of tokenizing 1GB in under 20 seconds
03Support for BPE, WordPiece, and Unigram subword algorithms
04Complete pipeline control from normalization to post-processing
05384 GitHub stars
06Advanced alignment tracking to map tokens to original character offsets