PDFSyntax

A Python library to inspect and transform the internal structure of PDF files.

Overview

PDFSyntax is a lightweight, pure Python library designed to handle the internal structure of PDF files (as defined in Chapter 7 "Syntax" of the PDF Specification). It allows developers to inspect, navigate, and transform PDF documents down to the byte level.

Key Features

  • Zero Dependencies: Written from scratch in pure Python.
  • Immutability: Favors non-destructive edits. By default, it uses incremental updates, appending changes to the end of the file rather than rewriting it. This allows for version history and undo capabilities.
  • Dual Interface:
    • API: A toolkit for Python developers to read, analyze, and modify PDFs.
    • CLI: Command-line tools to browse, disassemble, and extract text from PDFs.
  • Browser Visualization: Generate static HTML files to visually inspect the internal object structure of a PDF in a web browser.

Project Status

Beta Quality: This software is a work in progress. The API may change.

Current capabilities include inspection, rotation, metadata access, and basic text extraction. Future roadmap items include page cutting/appending, lossless compression, and advanced layout detection.

License

This project is licensed under the MIT License.