PDF Streamer — Large Document Processing for AI Workflows

PDF Streamer

TL;DR: A Claude Code skill that processes large PDFs (30+ pages) into clean markdown without blowing context windows. Streams page-by-page, persists to disk, handles column layouts, detects tables, flags scanned pages for vision transcription, and resumes from crashes. Output: one clean output.md plus per-page markdown files.

The Problem It Solves

Two issues hit the moment PDFs get long:

Context blowout — A 200-page report won’t fit in a single Claude turn. Even if it does, you pay tokens for content you don’t need.
Layout fragility — Text layers in real PDFs are uneven. Clean on body text, broken on tables, missing on scanned inserts. Treating the whole document as one blob hides which pages are good and which need help.

PDF Streamer solves both by streaming: pages process one at a time, each gets its own markdown file, and a manifest tracks state so a crash on page 247 doesn’t restart from zero.

How It Works

Architecture

PDF
 ├─ triage     →  manifest.json (per-page metadata)
 ├─ extract    →  pages/page-NNN.md (text-first, with vision fallback)
 └─ rollup     →  output.md (stripped headers/footers, stitched paragraphs)

Each stage is a Python script. Claude orchestrates and only steps in where judgment is needed. The pipeline creates a workdir next to the PDF:

book.workdir/
├── manifest.json          # Per-page state tracking
├── pages/                 # Per-page markdown
│   ├── page-001.md
│   ├── page-002.md
│   └── ...
├── vision/                # PNG renders for flagged pages
│   └── page-NNN.png
├── vision_queue.json      # Pages needing manual transcription
└── output.md              # Final rolled-up document

Key Capabilities

Capability	How It Works
Column-aware reading	Detects two-column layouts, reads left-column-then-right instead of zigzag
Table extraction	Uses pdfplumber to detect and render tables as markdown
Heading detection	Clusters font sizes — larger text becomes `#`/`##`/`###`
Header/footer stripping	Detects lines repeated on >50% of pages, strips them
Vision fallback	Flags pages with empty/broken text layers, renders PNG for Claude vision
Resumability	Manifest tracks status per page — crashes resume, don’t restart
Paragraph stitching	Rollup joins text split across page boundaries

When to Use It

Good fit:

Long-form PDFs (reports, books, manuals, contracts — 30+ pages)
Native digital PDFs with good text layers
Documents you need to reference across multiple Claude sessions
Wiki source material (turn PDFs into citable markdown)

Not a good fit:

Short PDFs (<10 pages) — just read them directly
Mostly-scanned PDFs — run OCR first (ocrmypdf)
Interactive forms — use a different tool
Q&A over a document — this produces markdown, not answers

Real-World Test Results

Test 1: Anthropic Skills Guide (33 pages)

33/33 pages extracted, 0 failures
Two-column layouts handled correctly
Tables rendered as markdown
Output: ~37k chars, 1.9k lines

Test 2: Breakthrough Advertising (239 pages)

Stress test on a real book with 1966 typography, single-column body, embedded vintage ad reproductions:

239/239 pages extracted, 0 failures
10 pages flagged for vision (all were embedded images — exactly the intended fallback case)
~67 seconds pipeline time (text-first)
After rollup: ~71,500 words, 4,605 lines

The Schwartz test forced two fixes that the smaller test didn’t reveal:

Threshold tuning for chapter running headers
Stem-normalization for headers with embedded page numbers

Workflow

Step 1: Run the Pipeline

python3 scripts/pipeline.py /path/to/document.pdf

Options:

--workdir /some/path — Override default workdir location
--start N --end M — Process page range only (preview before committing)
--force — Reprocess already-extracted pages

Step 2: Handle Vision Queue

After pipeline completes, check if vision_queue.json exists:

cat document.workdir/vision_queue.json

If it exists, pages need manual transcription:

Read the PNG from vision/page-NNN.png
Transcribe to markdown (Claude vision works well)
Write to pages/page-NNN.md
Mark extracted: python3 scripts/pipeline.py /path/to/pdf --mark-extracted N

Step 3: Roll Up

python3 scripts/rollup.py document.workdir/

This produces the final output.md with:

Repeated headers/footers stripped
Page-boundary paragraphs stitched
Page anchors () preserved for reference

Step 4: Verify

Spot-check output.md:

Sample 3-5 random pages by anchor
Confirm headings are at sensible levels
Verify tables look like tables
Check for garbled text

Integration with Wiki Workflows

PDF Streamer is designed as a feeder skill for wiki content:

Long PDF → pdf-streamer → output.md → wiki articles

Use case: Research ingestion

Download industry report or whitepaper
Run pdf-streamer to get clean markdown
Extract key insights into wiki pages
Cite the source with page references

The Schwartz Breakthrough Advertising test was exactly this — turning a classic marketing book into quotable wiki content. See glossary/awareness-levels for the result.

Technical Details

Dependencies

pip install pymupdf pdfplumber

pymupdf — Text extraction, page rendering, layout info
pdfplumber — Table detection (pymupdf’s table support is weaker)

What’s Robust vs. Speculative

Tested and working:

Per-page text extraction with column-aware reading order
Heading detection from font size clusters
Table extraction via pdfplumber
Manifest-driven resumability
Vision fallback flagging
Header/footer detection via cross-page repetition
Cross-page paragraph stitching

Not yet stress-tested:

3-column academic layouts
Borderless or zebra-striped tables
PDFs with rotation, embedded forms, or annotations
Heavily multi-equation academic papers

Installation

The skill lives in a Claude Code skills directory. To activate:

ln -s /path/to/pdf-streamer ~/.claude/skills/pdf-streamer

Verify with ls -la ~/.claude/skills/. Restart Claude Code for auto-discovery.

Key Takeaways

Streams, doesn’t load — Never puts entire PDF in context
Resumable — Crash on page 247, resume from 247
Vision fallback — Flags problem pages for manual transcription
Wiki feeder — Turns long PDFs into citable markdown
Column-aware — Handles two-column layouts correctly
Header stripping — Removes repetitive headers/footers automatically

glossary/llm-wiki-pattern — How this skill fits into wiki maintenance
tools/claude-skills — Understanding Claude Code skills
methodology — How this wiki is built and maintained

Sources

Primores Experiment 07 — PDF Streamer skill development
Stress tested on Breakthrough Advertising by Eugene Schwartz (239 pages)
Tested on Anthropic’s Complete Guide to Building Skills for Claude (33 pages)