PDF Streamer — Large Document Processing for AI Workflows
PDF Streamer
TL;DR: A Claude Code skill that processes large PDFs (30+ pages) into clean markdown without blowing context windows. Streams page-by-page, persists to disk, handles column layouts, detects tables, flags scanned pages for vision transcription, and resumes from crashes. Output: one clean
output.mdplus per-page markdown files.
The Problem It Solves
Two issues hit the moment PDFs get long:
-
Context blowout — A 200-page report won’t fit in a single Claude turn. Even if it does, you pay tokens for content you don’t need.
-
Layout fragility — Text layers in real PDFs are uneven. Clean on body text, broken on tables, missing on scanned inserts. Treating the whole document as one blob hides which pages are good and which need help.
PDF Streamer solves both by streaming: pages process one at a time, each gets its own markdown file, and a manifest tracks state so a crash on page 247 doesn’t restart from zero.
How It Works
Architecture
PDF ├─ triage → manifest.json (per-page metadata) ├─ extract → pages/page-NNN.md (text-first, with vision fallback) └─ rollup → output.md (stripped headers/footers, stitched paragraphs)Each stage is a Python script. Claude orchestrates and only steps in where judgment is needed. The pipeline creates a workdir next to the PDF:
book.workdir/├── manifest.json # Per-page state tracking├── pages/ # Per-page markdown│ ├── page-001.md│ ├── page-002.md│ └── ...├── vision/ # PNG renders for flagged pages│ └── page-NNN.png├── vision_queue.json # Pages needing manual transcription└── output.md # Final rolled-up documentKey Capabilities
| Capability | How It Works |
|---|---|
| Column-aware reading | Detects two-column layouts, reads left-column-then-right instead of zigzag |
| Table extraction | Uses pdfplumber to detect and render tables as markdown |
| Heading detection | Clusters font sizes — larger text becomes #/##/### |
| Header/footer stripping | Detects lines repeated on >50% of pages, strips them |
| Vision fallback | Flags pages with empty/broken text layers, renders PNG for Claude vision |
| Resumability | Manifest tracks status per page — crashes resume, don’t restart |
| Paragraph stitching | Rollup joins text split across page boundaries |
When to Use It
Good fit:
- Long-form PDFs (reports, books, manuals, contracts — 30+ pages)
- Native digital PDFs with good text layers
- Documents you need to reference across multiple Claude sessions
- Wiki source material (turn PDFs into citable markdown)
Not a good fit:
- Short PDFs (<10 pages) — just read them directly
- Mostly-scanned PDFs — run OCR first (
ocrmypdf) - Interactive forms — use a different tool
- Q&A over a document — this produces markdown, not answers
Real-World Test Results
Test 1: Anthropic Skills Guide (33 pages)
- 33/33 pages extracted, 0 failures
- Two-column layouts handled correctly
- Tables rendered as markdown
- Output: ~37k chars, 1.9k lines
Test 2: Breakthrough Advertising (239 pages)
Stress test on a real book with 1966 typography, single-column body, embedded vintage ad reproductions:
- 239/239 pages extracted, 0 failures
- 10 pages flagged for vision (all were embedded images — exactly the intended fallback case)
- ~67 seconds pipeline time (text-first)
- After rollup: ~71,500 words, 4,605 lines
The Schwartz test forced two fixes that the smaller test didn’t reveal:
- Threshold tuning for chapter running headers
- Stem-normalization for headers with embedded page numbers
Workflow
Step 1: Run the Pipeline
python3 scripts/pipeline.py /path/to/document.pdfOptions:
--workdir /some/path— Override default workdir location--start N --end M— Process page range only (preview before committing)--force— Reprocess already-extracted pages
Step 2: Handle Vision Queue
After pipeline completes, check if vision_queue.json exists:
cat document.workdir/vision_queue.jsonIf it exists, pages need manual transcription:
- Read the PNG from
vision/page-NNN.png - Transcribe to markdown (Claude vision works well)
- Write to
pages/page-NNN.md - Mark extracted:
python3 scripts/pipeline.py /path/to/pdf --mark-extracted N
Step 3: Roll Up
python3 scripts/rollup.py document.workdir/This produces the final output.md with:
- Repeated headers/footers stripped
- Page-boundary paragraphs stitched
- Page anchors (
<!-- page 17 -->) preserved for reference
Step 4: Verify
Spot-check output.md:
- Sample 3-5 random pages by anchor
- Confirm headings are at sensible levels
- Verify tables look like tables
- Check for garbled text
Integration with Wiki Workflows
PDF Streamer is designed as a feeder skill for wiki content:
Long PDF → pdf-streamer → output.md → wiki articlesUse case: Research ingestion
- Download industry report or whitepaper
- Run pdf-streamer to get clean markdown
- Extract key insights into wiki pages
- Cite the source with page references
The Schwartz Breakthrough Advertising test was exactly this — turning a classic marketing book into quotable wiki content. See glossary/awareness-levels for the result.
Technical Details
Dependencies
pip install pymupdf pdfplumber- pymupdf — Text extraction, page rendering, layout info
- pdfplumber — Table detection (pymupdf’s table support is weaker)
What’s Robust vs. Speculative
Tested and working:
- Per-page text extraction with column-aware reading order
- Heading detection from font size clusters
- Table extraction via pdfplumber
- Manifest-driven resumability
- Vision fallback flagging
- Header/footer detection via cross-page repetition
- Cross-page paragraph stitching
Not yet stress-tested:
- 3-column academic layouts
- Borderless or zebra-striped tables
- PDFs with rotation, embedded forms, or annotations
- Heavily multi-equation academic papers
Installation
The skill lives in a Claude Code skills directory. To activate:
ln -s /path/to/pdf-streamer ~/.claude/skills/pdf-streamerVerify with ls -la ~/.claude/skills/. Restart Claude Code for auto-discovery.
Key Takeaways
- Streams, doesn’t load — Never puts entire PDF in context
- Resumable — Crash on page 247, resume from 247
- Vision fallback — Flags problem pages for manual transcription
- Wiki feeder — Turns long PDFs into citable markdown
- Column-aware — Handles two-column layouts correctly
- Header stripping — Removes repetitive headers/footers automatically
Related
- glossary/llm-wiki-pattern — How this skill fits into wiki maintenance
- tools/claude-skills — Understanding Claude Code skills
- methodology — How this wiki is built and maintained
Sources
- Primores Experiment 07 — PDF Streamer skill development
- Stress tested on Breakthrough Advertising by Eugene Schwartz (239 pages)
- Tested on Anthropic’s Complete Guide to Building Skills for Claude (33 pages)