PDF Basics

Understanding how PDFSyntax works requires a brief introduction to the PDF file structure, as defined in ISO 32000.

File Structure Layers

A PDF file consists of four main domains:

  1. Objects: The basic building blocks (Booleans, Integers, Strings, Names, Arrays, Dictionaries, Streams).
  2. File Structure: How objects are stored (Header, Body, Cross-Reference Table, Trailer).
  3. Document Structure: The logical hierarchy (Catalog -> Pages -> Page).
  4. Content Streams: Instructions for painting the page (drawing text, shapes, images).

Key Components

The Header

The first line of the file, e.g., %PDF-1.4, indicating the version.

Indirect Objects

Objects that can be referenced by other objects. They are wrapped in obj ... endobj blocks and identified by an Object Number and Generation Number (e.g., 1 0 obj).

The Cross-Reference (XRef)

An index table at the end of the file listing the byte offset of every indirect object. This allows random access to any part of the document.

The Trailer

A dictionary located at the very end of the file. It tells the reader:

  • Where the Root (Catalog) object is.
  • Where the XRef table starts.
  • Security information (encryption).

Incremental Updates

PDFs are rarely rewritten from scratch. When you save a change in a PDF editor, the software often simply appends the new objects and a new XRef table to the end of the file. This preserves the original data and allows for "Undo" operations. PDFSyntax leverages this mechanism heavily.