Kreuzberg is a versatile document extraction skill designed for developers needing to process diverse file types including PDFs, Office documents, images, and academic formats. It provides a unified interface across multiple languages (Python, Node.js, Rust) to extract not just plain text, but also tables, metadata, and images. With built-in OCR support via Tesseract, EasyOCR, and PaddleOCR, it handles scanned documents seamlessly, making it an essential tool for RAG (Retrieval-Augmented Generation) pipelines, automated data entry, and large-scale content analysis workflows.
주요 기능
01Built-in text chunking and language detection for AI and LLM workflows
02Advanced OCR capabilities for scanned documents and images
03Supports 75+ file formats including PDF, Office, images, and emails
04Structured data extraction including tables, metadata, and semantic elements
056,420 GitHub stars
06High-performance Rust core with native bindings for Python and Node.js
사용 사례
01Processing batch archives of legacy file formats into searchable markdown content
02Automating data extraction from invoices, academic papers, and spreadsheets into structured JSON
03Building RAG pipelines by converting diverse document sets into clean, chunked text for embeddings