Agentic PDF triage
Start with cheap native text and per-page quality signals, then decide whether to render, OCR, search, or crop.
Extract text, layout, visual regions, OCR, metadata, warnings, and rendered page images so agents can inspect PDF evidence instead of trusting a single flattened text stream.
Most PDF extraction tools give an agent a single string and ask it to trust the result. That breaks down on real documents: research papers with two columns, slides where meaning sits in shapes, reports with charts and tables, government forms with widget fields, scanned pages with OCR residue, and multilingual PDFs whose text layer contains compatibility glyphs or mojibake.
pdfvision is built around a different loop:
That loop is closer to how a human reads a PDF. You skim the page, notice when the visual page and extracted text disagree, zoom into a chart or form field, and keep the original evidence available for verification.
pdfvision combines the PDF signals an agent needs in one CLI and TypeScript library:
Run pdfvision without installing it:
npx pdfvision document.pdfRender pages for a multimodal model:
npx pdfvision document.pdf --renderExtract structured JSON from a URL:
npx pdfvision --remote https://raw.githubusercontent.com/mozilla/pdf.js-sample-files/master/tracemonkey.pdf --format jsonSearch for evidence, then crop only the matching area:
npx pdfvision report.pdf --search "revenue" --json
npx pdfvision report.pdf --pages 3 --render --render-region 120,180,360,140 --render-output ./crops --jsonInspect visual structure without rendering every full page:
npx pdfvision slides.pdf --layout --image-boxes --vector-boxes --visual-regions --json
npx pdfvision slides.pdf --render-visual-regions --render-output ./regions --json