Trafilatura icon

Trafilatura

Createdadbar

Extracts text and metadata from web pages and online resources, offering various output formats.

About

Trafilatura is a Python package and command-line tool designed for efficient web crawling, scraping, and text extraction. It transforms raw HTML into structured data by gathering text, metadata, and comments from the web. With features like sitemap support, parallel processing, and customizable extraction options, Trafilatura aims to provide a balance between precision and recall, making it a robust and versatile solution for text-based web data collection and processing, also allowing output to commonly used formats. It is actively maintained and integrated into various projects by companies and institutions, offering comprehensive documentation and community support.

Key Features

  • Parallel processing of online and offline HTML input
  • Multiple output formats: TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI
  • 4,118 GitHub stars
  • Web crawling and text discovery with sitemap and feed support
  • Configurable extraction of main text, metadata, and formatting
  • Optional language detection and speed optimizations

Use Cases

  • Creating news aggregators and content monitoring systems
  • Building text corpora for research
  • Extracting data for Natural Language Processing (NLP) and Machine Learning (ML) applications