Extracts clean, plain text from LaTeX documents by stripping formatting commands and decoding special characters into Unicode.
The LaTeX Text Extractor (tex-strip) is a specialized utility designed to transform complex LaTeX source files into readable plain text. It recursively removes formatting commands, font styles, and nested tags while preserving the underlying content. Unlike basic strippers, it features a decoding phase that converts LaTeX-specific accents and ligatures into standard Unicode characters, making it an essential tool for preparing academic papers or technical documentation for content analysis, LLM processing, or simplified reading.
主な機能
01Normalization of whitespace and preservation of paragraph breaks
02Automatic conversion of escaped characters like & and % to standard text
03Recursive removal of nested LaTeX commands and formatting blocks
042 GitHub stars
05Unicode decoding for LaTeX accents and special ligatures
06Support for file-to-file batch processing and command-line input
ユースケース
01Cleaning LaTeX source code to perform accurate word counts and readability analysis
02Migrating content from .tex documents into web-based content management systems
03Converting academic LaTeX papers into plain text for LLM summarization