What file formats does the ingest skill support?

The ingest skill supports PDF, DOCX, XLSX, PPTX, and Markdown files, utilizing MinerU and MarkItDown for high-fidelity conversion.

Does it require an internet connection?

While it can query external APIs for metadata enrichment, you can use the --no-api flag to run the pipeline in offline mode.

What is the 'full' preset used for?

The 'full' preset runs the complete pipeline including OCR, metadata extraction, deduplication, table of contents generation, and final indexing.

How does it handle duplicate documents?

It implements specialized logic for different document types: academic papers are deduped by DOI, while patents are deduped by their unique public numbers.

Can I process massive conference proceedings?

Yes, it features a semi-automatic two-stage pipeline specifically for proceedings, allowing for structural review and intelligent splitting into individual papers.

ScholarAIO Document Ingestion

Name: ScholarAIO Document Ingestion
Author: ZimoLiao

byZimoLiao

•

265

•

データサイエンスとML

Processes academic papers, patents, and technical documents from various formats into a structured, searchable research knowledge base.

The ingest skill for ScholarAIO automates the complex pipeline of transforming raw research materials—including PDFs, Office documents, and patents—into AI-ready markdown. It handles advanced OCR via MinerU, automated metadata extraction, and deduplication using DOIs or Patent Public Numbers. Whether managing a single research paper or a multi-volume conference proceeding, this skill streamlines the transition from a messy inbox to a fully enriched research terminal, supporting specific workflows for academic theses, technical reports, and high-volume document sets.

主な機能

01Customizable processing presets for ingestion, re-indexing, or full content enrichment

02Specialized pipelines for academic papers, patents, and conference proceedings

03Multi-format support for PDF, DOCX, XLSX, and PPTX automated conversion

04Intelligent document segmenting and structural cleaning for complex proceedings

05265 GitHub stars

06Automated metadata extraction and deduplication using DOI and patent public numbers

ユースケース

01Converting technical Office documents into structured Markdown for RAG or AI analysis

02Building a local searchable library from a folder of academic PDFs and conference papers

03Standardizing a patent repository by extracting metadata and removing duplicates automatically

主な機能

01Customizable processing presets for ingestion, re-indexing, or full content enrichment

02Specialized pipelines for academic papers, patents, and conference proceedings

03Multi-format support for PDF, DOCX, XLSX, and PPTX automated conversion

04Intelligent document segmenting and structural cleaning for complex proceedings

05265 GitHub stars

06Automated metadata extraction and deduplication using DOI and patent public numbers

ユースケース

01Converting technical Office documents into structured Markdown for RAG or AI analysis

02Building a local searchable library from a folder of academic PDFs and conference papers

03Standardizing a patent repository by extracting metadata and removing duplicates automatically