Is the data stored locally or in the cloud?

The pipeline is designed for local or self-hosted deployment using Docker and DuckDB, ensuring sensitive investigative data remains under your control.

What document formats does this skill support?

It utilizes Apache Tika, which supports thousands of file types including PDF, DOCX, XLSX, and EML files.

How does it identify names and organizations?

The skill uses a spaCy NER pipeline to automatically detect and categorize persons, organizations, locations, dates, and financial figures.

Can it handle scanned documents or images?

Yes, it integrates Tesseract OCR to extract text from images and non-searchable PDF files into machine-readable text.

ICIJ Document Analysis

Name: ICIJ Document Analysis
Author: plurigrid

byplurigrid

•

Data Science & ML

Automates large-scale document processing and entity extraction using investigative journalism methodologies.

This skill implements a robust document processing pipeline designed for large-scale leak analysis, modeled after investigative techniques used in the Panama, Paradise, and Pandora Papers. It provides a comprehensive suite of tools including ICIJ Datashare for search, Apache Tika for universal extraction, Tesseract for OCR, and spaCy for high-accuracy Named Entity Recognition (NER). By coordinating between forensic validation and graph generation, it allows users to transform unstructured document troves into structured DuckDB databases, facilitating deep link analysis and entity co-occurrence mapping for complex investigations.

Key Features

01Automated batch processing for massive PDF and document corpora

02Advanced Named Entity Recognition (NER) for persons, organizations, and locations

038 GitHub stars

04Structured data modeling with DuckDB for relationship and co-occurrence analysis

05Self-hosted search and annotation via ICIJ Datashare integration

06Multi-format document extraction using Apache Tika and Tesseract OCR

Use Cases

01Extracting and mapping entity relationships for graph-based investigative reporting

02Analyzing massive leaked document sets to identify stakeholders and hidden networks

03Building searchable databases from diverse file types like PDFs, emails, and spreadsheets

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add plurigrid/asi icij-document-analysis

For use in Claude.ai and ChatGPT

Download Skill