Architecture & Design

PDFSyntax is built on specific architectural choices designed to make PDF manipulation safer and more transparent.

Immutability and Pure Functions

The library adopts a functional style. The core Doc object is treated as immutable. Functions that modify the document (like rotate or update_object) do not change the existing object in place. Instead, they return a new Doc object containing the requested changes.

# doc is unchanged; doc_new contains the rotation
doc_new = rotate(doc, 90)

The Doc Object

The Doc object acts as a container for:

  1. Index: A mapping of object numbers to their locations (byte offsets) in the file.
  2. Cache: Memoized parsed objects to avoid re-parsing the file repeatedly.
  3. Data: Handles to the raw binary file data.

Handling Revisions

PDFSyntax natively understands Incremental Updates.

  • Revisions: When you load a PDF, it may already contain multiple revisions (original save + subsequent edits).
  • Append-Only: When you modify the Doc using PDFSyntax, you are essentially creating a new revision in memory. When you call writefile, these changes are appended to the end of the original byte stream.
  • Rewind: You can programmatically roll back changes using rewind(doc) to access previous states of the document.
  • Squash: You can merge all revisions into a single, clean file using squash(doc) if the file size becomes too large due to history.