A PDF you export from Word or InDesign is rarely the smallest it can be. The file often contains uncompressed image data, fonts with thousands of unused glyphs, and redundant internal objects. PDF compression strips or recompresses that content — sometimes achieving a 10x reduction, sometimes only 5%, depending on what's inside. Here's what actually happens under the hood.
Why PDFs Get Large
What makes files big in the first place:
Uncompressed or over-quality images are the biggest culprit by far. A single high-resolution photo embedded in a PDF can be several megabytes on its own. Word processors and design tools often embed images at the full original resolution, even if the document is displayed at a fraction of that size. A 24MP camera photo printed at 2 inches on a page doesn't need 24 megapixels of data.
Unsubsetted fonts add significant overhead. When you embed a complete font file, you're including the outlines for every glyph the font contains — potentially thousands of characters for a Unicode font, even if your document only uses 40 of them. A single unsubsetted OpenType font can be 200–600KB.
Redundant objects accumulate when PDFs are incrementally updated. The PDF format allows appending new content without rewriting the whole file. After multiple save operations, the file can contain multiple versions of the same object — earlier revisions that are no longer referenced but still take up space.
Uncompressed content streams are page description data (the PostScript-like commands that describe where to draw text and graphics). These are usually compressed but sometimes aren't, particularly in PDFs produced by older or minimal-compliance tools.
Image Compression Inside PDFs
Images in PDFs use a filter chain — one or more compression algorithms applied to the raw pixel data. The choice of filter depends on the image content.
DCT (JPEG) compression is used for photographs and continuous-tone images. It's a lossy algorithm — it discards high-frequency detail that human vision is less sensitive to. Quality can be tuned from nearly invisible loss to heavy artifacting. Most PDF optimizers target 72–150 DPI for screen-resolution images and 150–300 DPI for print, then apply JPEG at quality 60–80. A raw 5MB photo can become 200KB this way.
FLATE (ZIP/Deflate) compression is the lossless option, used for graphics, diagrams, screenshots, and images with text. It's the same algorithm as ZIP files — it finds repeating patterns in the data and encodes them more efficiently. FLATE is the right choice when you can't afford any quality loss, like a logo or a chart with precise colors.
JBIG2 is specialized for black-and-white (1-bit) images — scanned text pages, fax output. It achieves dramatically better compression than FLATE for bitonal content by recognizing repeating symbol patterns (the same letter e appears many times on a page) and storing each symbol once. PDF/A-2 and later standards allow JBIG2; PDF/A-1 does not. The bitstream format is defined in ITU-T T.88 (the JBIG2 specification).
CCITT Group 4 is an older standard for bilevel images, used in many scanned document workflows. It's lossless and efficient for text scans, but JBIG2 achieves significantly better ratios on most real-world documents.
Font Subsetting
Font subsetting is one of the highest-impact optimizations for text-heavy PDFs. Instead of embedding the full font file, subsetting embeds only the glyphs actually used in the document.
If your document uses a font to display the word "Hello World", a subsetted version of that font contains outlines for H, e, l, o, W, r, d — just those seven glyphs. A typical English business document uses maybe 80–100 distinct characters out of a font that might contain 1,200.
PDF generators like Acrobat, InDesign, and LaTeX subset fonts by default. The problem arises with older tools, some Word-to-PDF converters, and PDFs that have been combined from multiple sources. A merged PDF can end up with several copies of the same font family, each slightly different, none subsetted.
Subsetting is always lossless. The visual output is identical — the glyph shapes are the same, just fewer of them. The only downside is that if you need to edit the PDF later, the font may not have the glyphs you're trying to add.
Object Streams and Cross-Reference Streams
PDF 1.5 introduced two structural improvements that reduce file size without touching content.
Object streams pack multiple PDF objects into a single compressed stream. A PDF contains many small objects — page dictionaries, resource lists, metadata records. Individually they're tiny, but the overhead of storing each as a separate top-level object adds up. Grouping them into object streams and compressing the whole group achieves better compression than compressing each object alone.
Cross-reference streams replace the traditional xref table (a plain text index of byte offsets for every object in the file) with a compressed binary stream. On a document with thousands of objects, the xref table can be substantial. The compressed stream format is both smaller and more efficient to parse.
Tools that produce PDF 1.4 and earlier output miss both of these. Optimizers that target PDF 1.5+ rewrite the file structure to take advantage of them.
Lossless vs Lossy: The Real Trade-off
PDF optimization tools typically offer two modes, and the difference matters depending on your use case.
Lossless optimization — Remove duplicate objects, recompress content streams with better settings, subset fonts, apply FLATE to uncompressed graphics, upgrade the file structure to use object/xref streams. Nothing visual changes. This usually achieves 10–30% size reduction on typical documents.
Lossy optimization — Downsample images to a lower resolution or recompress them at lower quality. This is where the big reductions happen. Downsampling a 300 DPI image to 150 DPI throws away 75% of the pixels. Recompressing JPEG images that were stored losslessly (as FLATE in the PDF) to JPEG at quality 75 can reduce image size by 80%. But the changes are irreversible — if you later need the high-resolution version, you have to go back to the source.
The practical rule: use lossless for archival documents, contracts, and anything that might be reprinted. Use lossy for email attachments, web PDFs, and documents that will only ever be viewed on screen at normal zoom.
What Ghostscript Actually Does
Ghostscript, the most common open-source PDF processing engine, uses /screen, /ebook, /printer, and /prepress settings as compression presets. The /screen preset downsamples color images to 72 DPI and applies aggressive JPEG compression. /ebook targets 150 DPI. /printer keeps 300 DPI but still recompresses images and subsets fonts. (Ghostscript's documentation covers each preset's exact knobs.)
Running Ghostscript's /ebook preset on a 10MB Word-to-PDF export is a common quick win:
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 -dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=compressed.pdf input.pdf
For multi-document workflows, merging first and then compressing the combined file is more efficient — the compressor can deduplicate fonts and resources across all the input files at once. See PDF Merge for combining files before running the optimization step.
Realistic Size Reduction Expectations
PDF compression isn't magic. Results depend entirely on what's inside the file.
Image-heavy PDFs — a brochure full of photographs — can shrink by 60–85% with lossy optimization. Lossless-only gets you maybe 5–15%.
Text-only PDFs — contracts, reports, forms — are already mostly compressed. Lossless optimization can get you 10–20%. Lossy adds little beyond font subsetting.
Scanned documents — the big variable. An uncompressed scan at 300 DPI is huge. Recompressing to JBIG2 (for black-and-white text) or JPEG can reduce a 15MB scan to under 1MB with acceptable visual quality.
Already-optimized PDFs — if someone already ran the file through Acrobat's optimizer or Ghostscript, you might get 2–5% additional savings at best. Adobe's own PDF optimization guide lists all the knobs Acrobat exposes for the same operations.
The PDF Compress tool handles the most impactful operations — image resampling, FLATE compression for graphics, and object stream optimization — without requiring you to install Ghostscript locally.
For a deeper understanding of the PDF format itself, including how content streams and the cross-reference table are structured, see How PDF Works. If you're curious about the compression algorithms PDF borrows from general data compression, How Data Compression Works covers the underlying theory.
FAQ
Why does the same PDF compress better in Acrobat vs Ghostscript?
Different optimizers make different trade-offs. Acrobat's "Reduced Size" uses smart defaults targeting average use, often producing 30–50% reduction with minimal quality loss. Ghostscript's /ebook preset is more aggressive on images (150 DPI cap, JPEG quality 75) and produces smaller files but with more visible quality drop on images. For maximum reduction with explicit control, use Ghostscript with custom parameters; for "just make it smaller without thinking," Acrobat works.
What's a realistic compression target for a 10MB PDF?
Depends on content. For an image-heavy brochure: typically 1–3MB after lossy compression (60–85% reduction). For a text-heavy report: typically 7–9MB after lossless optimization (10–20% reduction). For a scanned document: highly variable — JBIG2 can take a 15MB B&W scan to under 1MB. If you're not seeing meaningful reduction, the PDF is probably already optimized; further compression hits visible quality.
Should I downsample images to 72 DPI for web PDFs?
For screen-only PDFs (manuals, web brochures), yes — 72 DPI is enough for normal-zoom viewing on most displays. For PDFs that might be printed, use 150 DPI minimum (matches typical office printer resolution). For high-quality print, keep 300 DPI. Mobile/Retina displays sometimes benefit from 150 DPI even for screen-only viewing because users zoom in. The right number is "the resolution at which the user would view it, multiplied by 1.5x for safety."
Can I compress a password-protected PDF?
Not directly — most compression tools refuse encrypted input. You need to decrypt first (qpdf --decrypt --password=PWD encrypted.pdf decrypted.pdf), compress the decrypted version, then optionally re-encrypt. This is fine when you have the password; if you don't, you're stuck. Some PDF passwords are user-set (open password) and trivial to remove with the right tool; others are owner-set (permissions) and easier to bypass.
Why do PDF/A files often compress less?
PDF/A's archival requirements forbid certain compression techniques: no JBIG2 in PDF/A-1, no JPEG 2000 in some profiles, no encrypted streams. Plus PDF/A requires complete font embedding (no subsetting in some interpretations) and full ICC profiles. The trade-off is intentional: PDF/A optimizes for long-term readability, not file size. If size matters more than archival compliance, use regular PDF.
How does JBIG2 actually compress so well?
JBIG2 recognizes that scanned text pages contain many copies of the same character — the letter "e" might appear hundreds of times. Instead of storing each "e" as separate pixels, JBIG2 stores one master "e" symbol and references it for each instance, with small offsets for slight rendering variations. This pattern-matching approach achieves 5–10x better compression than CCITT Group 4 on typical text scans. The trade-off: aggressive JBIG2 modes can substitute one character for another (the famous "Xerox scanner bug" where 6s got swapped for 8s).
Should I compress PDFs before merging or after?
After. Merging first lets the optimizer deduplicate fonts and shared resources across all input files in one pass. If you compress each input file separately and then merge, you get redundant fonts (the same Arial subset embedded once per input file) and miss cross-document deduplication opportunities. Pattern: merge → compress → encrypt (if needed) → distribute.
What's the difference between linearizing and compressing?
Linearizing rearranges the PDF's internal structure for incremental download (first page renders before full file arrives) without changing content size meaningfully. Compressing reduces file size by recompressing images, subsetting fonts, removing duplicates, and optimizing streams. They're separate operations often done together: qpdf --linearize --compress-streams=y --object-streams=generate input.pdf output.pdf. For web-served PDFs, linearizing improves perceived load time even when the file size is similar.