01Hybrid semantic and keyword search via Qdrant
020 GitHub stars
03Multi-format content extraction (PDF, DOCX, PPTX, XLSX, CSV, EPUB, XML, TXT, Markdown, HTML, RTF)
04Automatic OCR fallback for scanned PDFs using vision models
05Hash-based deduplication to prevent duplicate ingestion
06Source integrity tracking to verify document references