PDFSyntax
A Python library to inspect and transform the internal structure of PDF files.
Overview
PDFSyntax is a lightweight, pure Python library designed to handle the internal structure of PDF files (as defined in Chapter 7 "Syntax" of the PDF Specification). It allows developers to inspect, navigate, and transform PDF documents down to the byte level.
Key Features
- Zero Dependencies: Written from scratch in pure Python.
- Immutability: Favors non-destructive edits. By default, it uses incremental updates, appending changes to the end of the file rather than rewriting it. This allows for version history and undo capabilities.
- Dual Interface:
- API: A toolkit for Python developers to read, analyze, and modify PDFs.
- CLI: Command-line tools to browse, disassemble, and extract text from PDFs.
- Browser Visualization: Generate static HTML files to visually inspect the internal object structure of a PDF in a web browser.
Project Status
Beta Quality: This software is a work in progress. The API may change.
Current capabilities include inspection, rotation, metadata access, and basic text extraction. Future roadmap items include page cutting/appending, lossless compression, and advanced layout detection.
License
This project is licensed under the MIT License.