Can I use Kreuzberg for RAG pipelines?

Absolutely. Kreuzberg is optimized for RAG workflows, offering built-in text chunking, metadata extraction, and markdown output formats to prepare data for vector embeddings.

Does Kreuzberg support OCR for scanned images?

Yes, Kreuzberg includes built-in OCR support using Tesseract and offers optional backends like EasyOCR and PaddleOCR for specialized image-to-text requirements.

What programming languages are supported?

While the core is written in Rust, Kreuzberg provides native bindings for Python, Node.js/TypeScript, Rust, Ruby, Go, Java, PHP, C#, and Elixir.

How do I handle password-protected PDFs?

You can pass a list of potential passwords through the PdfConfig or ExtractionConfig object within your chosen programming language's implementation to unlock and extract content.

Which document formats does Kreuzberg support?

Kreuzberg supports over 75 formats, including PDF, Microsoft Office (Word, Excel, PowerPoint), images (PNG, JPG), eBooks (EPUB), emails (EML, MSG), and various academic or markup formats.

Kreuzberg Document Intelligence

Name: Kreuzberg Document Intelligence
Author: kreuzberg-dev

bykreuzberg-dev

•

6,420

•

데이터 과학 및 ML

Extracts structured text, metadata, and tables from over 75 document formats using a high-performance Rust core.

Kreuzberg is a versatile document extraction skill designed for developers needing to process diverse file types including PDFs, Office documents, images, and academic formats. It provides a unified interface across multiple languages (Python, Node.js, Rust) to extract not just plain text, but also tables, metadata, and images. With built-in OCR support via Tesseract, EasyOCR, and PaddleOCR, it handles scanned documents seamlessly, making it an essential tool for RAG (Retrieval-Augmented Generation) pipelines, automated data entry, and large-scale content analysis workflows.

주요 기능

01Built-in text chunking and language detection for AI and LLM workflows

02Advanced OCR capabilities for scanned documents and images

03Supports 75+ file formats including PDF, Office, images, and emails

04Structured data extraction including tables, metadata, and semantic elements

056,420 GitHub stars

06High-performance Rust core with native bindings for Python and Node.js

사용 사례

01Processing batch archives of legacy file formats into searchable markdown content

02Automating data extraction from invoices, academic papers, and spreadsheets into structured JSON

03Building RAG pipelines by converting diverse document sets into clean, chunked text for embeddings

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add kreuzberg-dev/kreuzberg kreuzberg

For use in Claude.ai and ChatGPT

Download Skill