01Automated batch processing for massive PDF and document corpora
02Advanced Named Entity Recognition (NER) for persons, organizations, and locations
038 GitHub stars
04Structured data modeling with DuckDB for relationship and co-occurrence analysis
05Self-hosted search and annotation via ICIJ Datashare integration
06Multi-format document extraction using Apache Tika and Tesseract OCR