Docs Scraper
Createdfelores
Extracts clean, focused documentation from websites for human readers and LLM consumption.
About
This Python toolkit streamlines documentation management by extracting clean, focused content from websites. It offers multiple crawling strategies (single page, multi-page, sitemap-based, and menu-based) to efficiently gather documentation. The extracted content is formatted as clean Markdown and structured JSON, making it suitable for documentation sites, wikis, knowledge bases, LLM training, and RAG systems. By stripping away irrelevant elements like navigation menus and ads, it provides a ready-to-use documentation source for various applications.
Key Features
- Handles dynamic content and lazy-loaded elements
- 1 GitHub stars
- Provides colorful terminal feedback for status and errors
- Automatically identifies main content areas and removes irrelevant sections
- Offers multiple crawling strategies (single URL, multi-URL, sitemap, and menu-based)
- Outputs clean Markdown and structured JSON
Use Cases
- Preparing documentation for LLM training and RAG systems
- Creating documentation sites and wikis
- Building knowledge bases from dependency documentation