PDF Extractor
Leverages Apache Tika to extract content and metadata from various local file formats (PDF, DOCX, TXT) into HTML or plain text.
关于
The Tika Extractor is a Model Context Protocol (MCP) compliant server that utilizes Apache Tika to extract content and metadata from a wide range of file formats, including PDF, DOCX, and TXT, stored in a local directory. It offers conversion to HTML with optional CSS styling for enhanced readability or plain text, alongside capabilities to list available files and retrieve detailed metadata. Built with Java, Spring Boot, and Jetty, it seamlessly integrates with MCP-compliant clients and provides convenient REST endpoints for testing and direct HTML rendering, making it suitable for secure, offline document processing workflows.
主要功能
- File Extraction: Converts content to HTML (with CSS) or plain text using Apache Tika.
- File Listing: Scans a designated directory to list available files with details like size and MIME type.
- REST Testing Endpoints: Provides API endpoints for easy testing, including direct raw HTML serving for browser rendering.
- Metadata Extraction: Retrieves key metadata such as title, author, content type, and creation date.
- MCP Integration: Exposes four synchronous tools for content extraction, text extraction, file listing, and metadata retrieval.
- 0 GitHub stars
使用案例
- Enable secure, offline document processing workflows by extracting content and metadata locally.
- Integrate with MCP-compliant clients like Claude Desktop or MCP Inspector for automated document analysis.
- Facilitate testing of content extraction and metadata retrieval capabilities via dedicated REST APIs.