01Automated Named Entity Recognition (NER) for persons, organizations, locations, and financial data
02Standardized DuckDB schema for document metadata and entity co-occurrence tracking
03Multi-format text extraction via Apache Tika and batch OCR processing with Tesseract
04Self-hosted document indexing and search integration via ICIJ Datashare
05Pipeline coordination for forensic validation and graph-based investigation workflows
068 GitHub stars