Generates photorealistic smartphone photos of logistics documents with induced defects and verified ground truth.
Penquify is a Python toolkit designed to create high-fidelity synthetic datasets for training and testing Document AI and OCR models. It acts as an 'OCR in reverse' by intentionally degrading documents, simulating real-world imperfections like coffee stains, folds, blur, and skew. Starting from simple data inputs like JSON or existing PDFs, it generates clean, realistic logistics documents (e.g., dispatch guides), then transforms them into photorealistic smartphone images, complete with verified ground truth for every data field and an occlusion manifest. This allows developers and data scientists to generate vast, diverse datasets for robust model training without needing real-world document collection.
