Skip to content

PDF Streamer — Large Document Processing for AI Workflows

PDF Streamer

TL;DR: A Claude Code skill that processes large PDFs (30+ pages) into clean markdown without blowing context windows. Streams page-by-page, persists to disk, handles column layouts, detects tables, flags scanned pages for vision transcription, and resumes from crashes. Output: one clean output.md plus per-page markdown files.

The Problem It Solves

Two issues hit the moment PDFs get long:

  1. Context blowout — A 200-page report won’t fit in a single Claude turn. Even if it does, you pay tokens for content you don’t need.

  2. Layout fragility — Text layers in real PDFs are uneven. Clean on body text, broken on tables, missing on scanned inserts. Treating the whole document as one blob hides which pages are good and which need help.

PDF Streamer solves both by streaming: pages process one at a time, each gets its own markdown file, and a manifest tracks state so a crash on page 247 doesn’t restart from zero.


How It Works

Architecture

PDF
├─ triage → manifest.json (per-page metadata)
├─ extract → pages/page-NNN.md (text-first, with vision fallback)
└─ rollup → output.md (stripped headers/footers, stitched paragraphs)

Each stage is a Python script. Claude orchestrates and only steps in where judgment is needed. The pipeline creates a workdir next to the PDF:

book.workdir/
├── manifest.json # Per-page state tracking
├── pages/ # Per-page markdown
│ ├── page-001.md
│ ├── page-002.md
│ └── ...
├── vision/ # PNG renders for flagged pages
│ └── page-NNN.png
├── vision_queue.json # Pages needing manual transcription
└── output.md # Final rolled-up document

Key Capabilities

CapabilityHow It Works
Column-aware readingDetects two-column layouts, reads left-column-then-right instead of zigzag
Table extractionUses pdfplumber to detect and render tables as markdown
Heading detectionClusters font sizes — larger text becomes #/##/###
Header/footer strippingDetects lines repeated on >50% of pages, strips them
Vision fallbackFlags pages with empty/broken text layers, renders PNG for Claude vision
ResumabilityManifest tracks status per page — crashes resume, don’t restart
Paragraph stitchingRollup joins text split across page boundaries

When to Use It

Good fit:

  • Long-form PDFs (reports, books, manuals, contracts — 30+ pages)
  • Native digital PDFs with good text layers
  • Documents you need to reference across multiple Claude sessions
  • Wiki source material (turn PDFs into citable markdown)

Not a good fit:

  • Short PDFs (<10 pages) — just read them directly
  • Mostly-scanned PDFs — run OCR first (ocrmypdf)
  • Interactive forms — use a different tool
  • Q&A over a document — this produces markdown, not answers

Real-World Test Results

Test 1: Anthropic Skills Guide (33 pages)

  • 33/33 pages extracted, 0 failures
  • Two-column layouts handled correctly
  • Tables rendered as markdown
  • Output: ~37k chars, 1.9k lines

Test 2: Breakthrough Advertising (239 pages)

Stress test on a real book with 1966 typography, single-column body, embedded vintage ad reproductions:

  • 239/239 pages extracted, 0 failures
  • 10 pages flagged for vision (all were embedded images — exactly the intended fallback case)
  • ~67 seconds pipeline time (text-first)
  • After rollup: ~71,500 words, 4,605 lines

The Schwartz test forced two fixes that the smaller test didn’t reveal:

  • Threshold tuning for chapter running headers
  • Stem-normalization for headers with embedded page numbers

Workflow

Step 1: Run the Pipeline

Terminal window
python3 scripts/pipeline.py /path/to/document.pdf

Options:

  • --workdir /some/path — Override default workdir location
  • --start N --end M — Process page range only (preview before committing)
  • --force — Reprocess already-extracted pages

Step 2: Handle Vision Queue

After pipeline completes, check if vision_queue.json exists:

Terminal window
cat document.workdir/vision_queue.json

If it exists, pages need manual transcription:

  1. Read the PNG from vision/page-NNN.png
  2. Transcribe to markdown (Claude vision works well)
  3. Write to pages/page-NNN.md
  4. Mark extracted: python3 scripts/pipeline.py /path/to/pdf --mark-extracted N

Step 3: Roll Up

Terminal window
python3 scripts/rollup.py document.workdir/

This produces the final output.md with:

  • Repeated headers/footers stripped
  • Page-boundary paragraphs stitched
  • Page anchors (<!-- page 17 -->) preserved for reference

Step 4: Verify

Spot-check output.md:

  • Sample 3-5 random pages by anchor
  • Confirm headings are at sensible levels
  • Verify tables look like tables
  • Check for garbled text

Integration with Wiki Workflows

PDF Streamer is designed as a feeder skill for wiki content:

Long PDF → pdf-streamer → output.md → wiki articles

Use case: Research ingestion

  1. Download industry report or whitepaper
  2. Run pdf-streamer to get clean markdown
  3. Extract key insights into wiki pages
  4. Cite the source with page references

The Schwartz Breakthrough Advertising test was exactly this — turning a classic marketing book into quotable wiki content. See glossary/awareness-levels for the result.


Technical Details

Dependencies

Terminal window
pip install pymupdf pdfplumber
  • pymupdf — Text extraction, page rendering, layout info
  • pdfplumber — Table detection (pymupdf’s table support is weaker)

What’s Robust vs. Speculative

Tested and working:

  • Per-page text extraction with column-aware reading order
  • Heading detection from font size clusters
  • Table extraction via pdfplumber
  • Manifest-driven resumability
  • Vision fallback flagging
  • Header/footer detection via cross-page repetition
  • Cross-page paragraph stitching

Not yet stress-tested:

  • 3-column academic layouts
  • Borderless or zebra-striped tables
  • PDFs with rotation, embedded forms, or annotations
  • Heavily multi-equation academic papers

Installation

The skill lives in a Claude Code skills directory. To activate:

Terminal window
ln -s /path/to/pdf-streamer ~/.claude/skills/pdf-streamer

Verify with ls -la ~/.claude/skills/. Restart Claude Code for auto-discovery.


Key Takeaways

  • Streams, doesn’t load — Never puts entire PDF in context
  • Resumable — Crash on page 247, resume from 247
  • Vision fallback — Flags problem pages for manual transcription
  • Wiki feeder — Turns long PDFs into citable markdown
  • Column-aware — Handles two-column layouts correctly
  • Header stripping — Removes repetitive headers/footers automatically


Sources

  • Primores Experiment 07 — PDF Streamer skill development
  • Stress tested on Breakthrough Advertising by Eugene Schwartz (239 pages)
  • Tested on Anthropic’s Complete Guide to Building Skills for Claude (33 pages)