I kept paying tokens and uploading documents to cloud APIs just to pull text
out of screenshots and scanned PDFs, so I wrapped macOS's built-in Vision
framework as an MCP server. Any MCP client (Claude Desktop/Code, Cursor) can
call it to OCR images and PDFs, read QR/barcodes, detect faces, find document
corners, and classify images — entirely on-device.
Two things made it worth packaging: nothing leaves the Mac (no API keys, no network calls), and sending extracted text instead of raw page images cut tokens ~97% on the documents I tested — Apple Vision is also often more accurate than a vision model on dense text. OCR returns reading-order paragraphs with bounding boxes and confidence, so the model can rebuild Markdown/HTML.
It's a small native Swift helper behind a Node MCP server. Limitations: macOS-only, and quality is whatever Apple Vision gives you.
Install: npx -y macos-vision-mcp
Happy to answer questions about the structured output or how the token measurement was done.