Is the entity recognition customizable?

The skill utilizes the spaCy NER pipeline, which can be configured or swapped for different language models and entity labels depending on your dataset.

Can I export the analyzed data to other tools?

Yes, the skill provides recipes to export extracted entities and co-occurrences from DuckDB into CSV formats compatible with Neo4j and other graph analysis tools.

What document formats does this skill support?

Through its integration with Apache Tika, the skill supports thousands of file types including PDF, DOCX, XLSX, EML, and various image formats.

How does it handle scanned images or non-searchable PDFs?

The skill includes a Tesseract OCR pipeline that automatically converts images and scanned documents into machine-readable text.

Does this require external cloud services to process documents?

No, the pipeline is designed to run locally or via self-hosted Docker containers (Tika, Tesseract, Datashare) to ensure data privacy and security.

ICIJ Document Analysis

Name: ICIJ Document Analysis
Author: plurigrid

byplurigrid

•

データサイエンスとML

Processes large-scale document leaks using investigative journalism methodologies to extract entities and build searchable databases.

This skill implements the specialized document processing pipeline pioneered by the International Consortium of Investigative Journalists (ICIJ) for landmark investigations like the Panama Papers and Pandora Papers. It coordinates a sophisticated stack including Apache Tika for content extraction, Tesseract for OCR, and spaCy for Named Entity Recognition (NER) to transform unstructured document troves into structured DuckDB databases. It is designed for researchers and developers who need to automate the analysis of massive, heterogeneous datasets while maintaining forensic integrity and preparing data for graph-based relationship mapping.

主な機能

01Automated Named Entity Recognition (NER) for persons, organizations, locations, and financial data

02Standardized DuckDB schema for document metadata and entity co-occurrence tracking

03Multi-format text extraction via Apache Tika and batch OCR processing with Tesseract

04Self-hosted document indexing and search integration via ICIJ Datashare

05Pipeline coordination for forensic validation and graph-based investigation workflows

068 GitHub stars

ユースケース

01Transforming unstructured document collections into structured data for network analysis and graph databases

02Analyzing large-scale leaked document sets for investigative reporting and research

03Automating the extraction of corporate and personal entities from massive PDF and image archives

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add plurigrid/asi icij-document-analysis

For use in Claude.ai and ChatGPT

Download Skill