What does OCR do to a PDF?

OCR reads the page images, recognizes the characters, and adds an invisible text layer behind the image. The page looks identical but the text becomes selectable and searchable.

What resolution is best for OCR?

Around 300 DPI is the sweet spot. Below roughly 200 DPI accuracy drops fast, and skewed or low-contrast scans also hurt results.

Do I need OCR to convert a scanned PDF to Word?

Yes. Without OCR, converting a scan to Word just embeds the page image. OCR first so there's real text to convert.

convert →

How to OCR a scanned PDF into searchable, selectable text

MSMaya SundaramMay 30, 20267 min read

Arthize guide cover — how to OCR a scanned PDF

The short version

A scanned PDF is an image of text with no real characters — OCR adds an invisible, searchable text layer.
Accuracy is set by the scan: aim for 300 DPI, straight pages, good contrast, and tell it the language.
OCR is the prerequisite for converting, searching, and reliably redacting scanned documents.
Scans are often your most sensitive files — don't run them through free third-party cloud OCR.

You have a scanned PDF, you press Ctrl+F to find a name, and nothing happens. The document looks like text, but to the computer it's a photograph — millions of pixels that happen to form letter shapes, with no actual letters anywhere. OCR (optical character recognition) is the step that reads those pixels and writes real, searchable text back into the file. It's the difference between a picture of a document and a document.

What OCR actually does

An OCR engine looks at the image of each page, recognizes the shapes as characters and words, and adds an invisible text layer underneath the picture. The page still looks identical — same scan, same coffee stain — but now there's selectable, searchable text sitting behind the image, aligned to where the words appear. That's why a well-OCR'd scan lets you highlight a sentence even though you're technically clicking on a photo.

How to OCR a scanned PDF

Confirm you actually need it: try to select text. If you can't, it's a scan and OCR will help.
Open the OCR tool and upload the scanned PDF.
Pick the document's language(s) — this dramatically improves accuracy.
Run it, then test: search for a word you can see on the page. It should jump straight to it.

What makes OCR accurate (or not)

OCR quality is mostly decided before you ever run it, by the quality of the scan:

Resolution. 300 DPI is the sweet spot. Below ~200 DPI, the engine starts guessing and "rn" becomes "m."
Straightness. Skewed or rotated pages tank accuracy. De-skewing first helps a lot.
Contrast. Crisp black text on white scans best. Faded photocopies and colored backgrounds are hard.
Language. Telling the engine the right language (and any accents) prevents a whole class of errors.

Modern engines (Arthize uses the Tesseract-based pipeline under the hood) are genuinely good on clean scans — often well above 98% character accuracy — but they are not magic on a crumpled fax. Garbage in, garbage out still rules.

Why OCR is the unlock step for everything else

OCR isn't just about Ctrl+F. It's the prerequisite that makes other tools work on scans:

Conversion

Trying to convert a scanned PDF to Word without OCR gives you a Word file with a picture pasted in it. OCR first, and you get editable text.

Redaction and search

You can't reliably redact what you can't search. OCR lets you find every instance of a name across a 200-page scanned discovery dump instead of eyeballing each page.

Smaller, smarter files

Once a scan has a text layer, you have more options to compress the page images without losing the ability to read the content.

Scans are often the most sensitive documents

Think about what gets scanned: passports, tax forms, signed contracts, medical records. These are exactly the documents you least want sitting on a free OCR site's server. And OCR is computationally heavy, so a lot of free tools lean on third-party cloud APIs — meaning your passport scan may pass through yet another company you've never heard of. We run OCR inside the same private workspace as the rest of Arthize, no third-party hop. More on that calculus in what happens when you upload a PDF to a free tool.

OCR sits in the "convert" leg of the PDF workflow guide — but really it's the gate that the conversion, redaction, and search tools all depend on for scanned documents.

Frequently asked

What does OCR do to a PDF?: OCR reads the page images, recognizes the characters, and adds an invisible text layer behind the image. The page looks identical but the text becomes selectable and searchable.
What resolution is best for OCR?: Around 300 DPI is the sweet spot. Below roughly 200 DPI accuracy drops fast, and skewed or low-contrast scans also hurt results.
Do I need OCR to convert a scanned PDF to Word?: Yes. Without OCR, converting a scan to Word just embeds the page image. OCR first so there's real text to convert.

Maya Sundaram

Co-founder & document-tooling engineer, Arthize

Maya has spent the last decade building document-processing systems — first for a legal-tech startup that ingested millions of scanned filings, now at Arthize where she owns the conversion, OCR and compression pipelines. She has opinions about Ghostscript flags.