How PDF Works — Inside the Portable Document Format

Q: How do I read PDF metadata programmatically?

PDF metadata is in the trailer's /Info dictionary or in an XMP packet near the document catalog. Standard fields: /Title, /Author, /Subject, /Keywords, /CreationDate, /ModDate. In Node, pdf-lib exposes them via doc.getTitle(), doc.getAuthor(). In Python, PyPDF2 has reader.metadata. Note: many PDF generators leave default values like "Microsoft Word" or "TCPDF Generator" in the Producer field — check before publishing.

What PDF Was Designed To Be

PDF stands for Portable Document Format. Adobe created it in 1993 with one goal: a document that looks identical on every device, regardless of operating system, fonts, or screen size. Unlike HTML, which reflows, or a Word document, which renders differently based on installed fonts, a PDF is a precise description of a page. Place a word at a specific point and that's exactly where it appears on every viewer. (See Wikipedia's PDF overview for a broader history.)

Technically, PDF is a page-description language — a direct descendant of PostScript, Adobe's earlier printer language. Where PostScript is a full programming language that a printer executes to draw a page, PDF is a static subset: a fixed description of what's on each page, encoded into a structured binary file. No execution, no loops — just content and positioning.

The File Structure

A PDF file has four major sections that appear in order: the header, the body, the cross-reference table, and the trailer. (The authoritative description is Adobe's PDF Reference 1.7 — the same document that became ISO 32000-1.)

The header is just one line. It identifies the file as a PDF and declares the version:

%PDF-1.7

The second line conventionally includes four bytes with values above 127. This hints to transfer tools that the file is binary, not plain text, so they don't try to do line-ending conversion on it.

The body contains all the objects that make up the document — pages, images, fonts, annotations, metadata. Each object has a unique object number and generation number:

12 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] >>
endobj

Object 12, generation 0, is a Page object. It references its parent (object 2) and declares its dimensions (612×792 points, which is US Letter at 72 points per inch).

The cross-reference table (xref) is an index. It maps each object number to its byte offset in the file, enabling random access — you can jump directly to any object without scanning the entire file:

xref
0 13
0000000000 65535 f
0000000009 00000 n
...

The trailer points to the xref table and to the document's root object (the catalog), which is the entry point for navigating the document tree:

trailer
<< /Size 13 /Root 1 0 R /Info 11 0 R >>
startxref
8492
%%EOF

Content Streams and Operators

The actual visible content on a page — text, lines, images — lives in content streams. A content stream is a sequence of PDF operators, which are short keywords, with their operands. Think of it as a very low-level drawing instruction set:

BT                       % Begin text object
/F1 12 Tf               % Use font F1 at 12pt
100 700 Td              % Move text position: 100 right, 700 up from line start
(Hello, PDF World) Tj  % Show text string
ET                       % End text object

The coordinate system has its origin at the bottom-left corner of the page, with y increasing upward. This is the PostScript/mathematical convention, opposite to what most screen rendering systems use. When you're working with PDF coordinates programmatically, this trips people up — y=700 on a Letter page (792pt tall) puts you 700 points from the bottom, or about 92 points from the top.

Distances are measured in points: 1 point = 1/72 of an inch. A US Letter page is 612×792pt. An A4 page is 595×842pt.

How Fonts Are Embedded

Font embedding is what makes PDFs portable. When a font is embedded, the font data travels with the file — the viewer doesn't need that font installed. The tradeoff is file size, but it's the right tradeoff for document exchange.

PDF supports several font formats:

Type 1 — Adobe's original outline font format, still common in older PDFs
TrueType — the format most Windows and older Mac fonts use
OpenType — the modern standard, which can contain either TrueType or CFF (Compact Font Format) outlines
CIDFont — for large character sets (Asian scripts), where character codes map into a glyph dictionary rather than a simple encoding vector

Full embedding stores the entire font. Subsetting stores only the glyphs actually used in the document, with a randomized tag prepended to the font name (e.g. ABCDEF+Inter) to indicate it's a subset. A report that only uses ASCII can have a subsetted font under 20KB even for a complex typeface.

Fonts that are not embedded rely on the viewer having that font installed. When it doesn't, the viewer substitutes — and the layout shifts, often badly. This is why "embed all fonts" is standard practice before sharing PDFs.

Image Streams

Images in PDF are also stored as streams, typically compressed. PDF supports several image compression formats natively: JPEG (DCTDecode), ZIP/deflate (FlateDecode), CCITT Group 4 (for monochrome scans), and JBIG2. When you insert a JPEG into a PDF, it's usually stored as-is with DCTDecode — no re-compression, no quality loss.

Images are referenced from the page's content stream as XObjects:

/Im1 Do  % Draw image XObject named Im1

Their position and scale come from the current transformation matrix, set before the Do operator.

PDF Versions and Linearization

PDF has gone through versions 1.0 (1993) to 1.7 (2008), then the ISO standard took over with PDF 2.0 (2017) — the current spec is published as ISO 32000-2. Each version added features: transparency (1.4), forms (1.2), digital signatures (1.3), optional content layers (1.5). Most tools target 1.4 or 1.7 for broad compatibility.

Linearized PDFs (also called "fast web view") rearrange the file structure so a viewer can begin displaying the first page before the entire file is downloaded. The first page's objects appear at the beginning of the file, and a special linearization hint table helps the viewer request the right byte ranges. It's still the same four-section structure — just with objects reordered for streaming access.

You can tell if a PDF is linearized by the presence of /Linearized 1 near the start of the file. Many PDF generators have a "Optimize for web" option that enables this.

For merging or compressing existing PDFs without needing to understand the internals, PDF Merge and PDF Compress handle the heavy lifting. To go deeper on what compression does to PDF file structure, the follow-up post How PDF Compression Works covers stream-level optimization and re-encoding strategies. And for a practical comparison of when to use PDF at all, PDF vs DOCX covers the workflow decision.

FAQ

Why does PDF use coordinates from the bottom-left instead of top-left like the web?

PDF inherited the convention from PostScript, which itself inherited it from mathematical graph paper — origin at bottom-left, y increasing upward. Most screen rendering systems (HTML/CSS, Canvas, OpenGL) flipped this to top-left because text fills the page top-to-bottom and the convention matches reading order. When working with PDF programmatically, you need to convert: a top-edge offset of N points becomes pageHeight - N in PDF coordinates.

Should I generate PDFs server-side or client-side?

Depends on data sensitivity and complexity. Client-side (jsPDF, pdfmake, pdf-lib in browser) keeps user data local and avoids server load — good for invoices, certificates, simple reports. Server-side (Puppeteer/headless Chrome, Playwright, wkhtmltopdf, Prince) handles complex layouts and CSS reliably — good for HTML-to-PDF conversion of styled content. For absolute layout fidelity, native PDF libraries (pdf-lib, PDFKit) beat HTML-to-PDF every time but require manual layout.

What's the difference between PDF/A and regular PDF?

PDF/A is an ISO archival standard requiring full font embedding, no external dependencies, no encryption, and a fixed render appearance. PDF/A-1, PDF/A-2, PDF/A-3 are the levels — PDF/A-3 allows embedded files (useful for invoices that include source XML). Use PDF/A for legal, archival, or compliance documents that must render identically in 50 years. Regular PDF is fine for everyday sharing where future-proofing isn't critical.

Why are my PDFs so much larger than the source images?

Usually because images are being re-encoded with lossless compression instead of staying as JPEG. Most PDF generators automatically use JPEG (DCTDecode) for photos and FlateDecode for diagrams — but some hand-rolled generators don't. Check your PDF size against the sum of source image sizes; if it's much bigger, your library is likely re-encoding losslessly. Tools like Ghostscript and qpdf can re-compress at lower quality.

What's a linearized PDF and why does it matter?

A linearized PDF is structured for incremental download — the viewer can start showing the first page before the rest of the file arrives. Most PDF readers (Acrobat, browser viewers, mobile apps) honor this for documents over a few MB. To linearize: most PDF tools have an "Optimize for web view" option, or use qpdf --linearize input.pdf output.pdf. For PDFs delivered via web (forms, manuals, reports), linearizing is a free UX win.

Can PDFs really preserve text searchability after scanning?

Only if OCR has been applied — a raw scan is just images of pages with no text data. PDFs with searchable text from scans contain both the image (visible) and an invisible text layer beneath it (selectable, searchable). This is called "PDF/A with OCR" or "searchable PDF." Tools like Adobe Acrobat, Tesseract + ocrmypdf, and ABBYY FineReader produce these. Without OCR, even a scanned page with crisp typeface can't be searched.

Why do PDFs sometimes show as "blocked" by enterprise security?

Enterprise security tools (Microsoft Defender ATP, Symantec, etc.) inspect PDFs for embedded JavaScript, suspicious URI actions, and known exploit patterns. PDFs can contain interactive JavaScript via the /JavaScript action — historically a malware vector. Some enterprises block all PDFs with embedded scripts. To avoid: don't embed JavaScript in PDFs you generate, and use static forms (AcroForms with submit-via-URL) instead of script-driven ones.

How do I read PDF metadata programmatically?

PDF metadata is in the trailer's /Info dictionary or in an XMP packet near the document catalog. Standard fields: /Title, /Author, /Subject, /Keywords, /CreationDate, /ModDate. In Node, pdf-lib exposes them via doc.getTitle(), doc.getAuthor(). In Python, PyPDF2 has reader.metadata. Note: many PDF generators leave default values like "Microsoft Word" or "TCPDF Generator" in the Producer field — check before publishing.