Extracts text and metadata from web pages and online resources, offering various output formats.
Trafilatura is a Python package and command-line tool designed for efficient web crawling, scraping, and text extraction. It transforms raw HTML into structured data by gathering text, metadata, and comments from the web. With features like sitemap support, parallel processing, and customizable extraction options, Trafilatura aims to provide a balance between precision and recall, making it a robust and versatile solution for text-based web data collection and processing, also allowing output to commonly used formats. It is actively maintained and integrated into various projects by companies and institutions, offering comprehensive documentation and community support.