01Parallel processing of online and offline HTML input
02Multiple output formats: TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI
034,118 GitHub stars
04Web crawling and text discovery with sitemap and feed support
05Configurable extraction of main text, metadata, and formatting
06Optional language detection and speed optimizations