OCR Explained: Making Scanned PDFs Searchable

You receive a scanned PDF — a contract photographed on a phone, a stack of invoices run through an office scanner, a textbook page captured in a hurry. You need to find a specific clause or copy a number. You try Ctrl+F. Nothing. You try to select text. The cursor turns into a crosshair for selecting images instead of text.

The document exists as a picture of text rather than actual text. OCR — Optical Character Recognition — is how you fix that.

What Is OCR, Actually?

Optical Character Recognition is the process of analyzing a raster image (pixels arranged in a grid) and identifying which characters those pixels represent. The output is a Unicode string of the recognized text.

When OCR is applied to a scanned PDF, the tool:

Renders each page to a high-resolution pixel array
Runs the OCR engine on each page image
Generates an invisible text layer containing the recognized text
Overlays this text layer on top of the original page image

The result is a PDF where you can:

Search with Ctrl+F / Cmd+F
Select and copy text
Have screen readers read the content aloud
Use document indexing tools that extract text

The visual appearance of the page doesn’t change — you still see the scanned image. The text layer is hidden behind it.

A Brief History of OCR Technology

OCR dates to the 1950s. Early systems were specialized hardware that could recognize only a single font. By the 1980s, software OCR (like Omnipage and FineReader) could handle multiple typefaces. By the 1990s, accuracy on clean, typed documents reached 99%.

The modern generation is neural-network-based. Tesseract, originally developed at HP Labs and now open-source (maintained by Google), switched to an LSTM (Long Short-Term Memory) neural network architecture in version 4.0 (2018). This dramatically improved accuracy on challenging inputs: varied fonts, slight skew, background noise, and multiple languages.

Google’s Cloud Vision API, Amazon Textract, and Microsoft Azure Computer Vision use deep learning models trained on billions of document images. For very clean documents, all modern engines achieve 99%+ character accuracy. The differences emerge on difficult inputs.

How Tesseract Works (And Why It’s in Your Browser)

Tesseract is compiled to WebAssembly for use in browsers via Tesseract.js. This is what powers our Scan PDF (OCR) tool.

The recognition pipeline:

1. Preprocessing. The page image is binarized (converted to pure black and white) using an adaptive thresholding algorithm. This separates text pixels from background. Noise is reduced with dilation/erosion operations.

2. Layout analysis. The engine segments the page into regions: columns, paragraphs, text lines, and individual words. This step — called “page segmentation” — determines reading order and identifies which regions contain text vs. images vs. tables.

3. Line normalization. Each text line is straightened (deskewed) and normalized to a standard x-height. Slight rotation from scanning is corrected here.

4. Character recognition. The LSTM processes each text line as a sequence, predicting the most likely character sequence. Unlike earlier CNN approaches that recognized individual characters, the LSTM treats the line holistically — it uses context to distinguish, for example, ‘rn’ from ‘m’ in ambiguous cases.

5. Dictionary and language model. A language model scores candidate character sequences. “the” is more likely than “tbe” even if the visual evidence is ambiguous. This is why OCR accuracy varies by language — languages with richer Tesseract training data get better results.

6. Output. The recognized text is placed in a PDF text layer with precise coordinates mapping each word back to its position on the page. This is what makes text selection work accurately — clicking on a word selects the corresponding text layer element.

Factors That Most Affect OCR Accuracy

Scan resolution (DPI)

Resolution is the single biggest factor in OCR accuracy:

DPI	Typical OCR accuracy	Notes
72–96	60–75%	Screen captures; many errors
150	85–92%	Marginal; acceptable for rough drafts
200–300	95–99%	Standard for document archiving
400–600	99%+	Archives, legal, medical records

For production-quality results, scan at 300 DPI minimum. Most office scanners default to 200 DPI — this is adequate but not ideal. If you’re re-scanning something important, use 300 or 400 DPI.

Document condition

Print quality: Sharp laser prints OCR nearly perfectly. Faded dot-matrix or thermal paper is harder.

Paper quality: Yellowed or discolored paper reduces contrast, making the binarization step harder.

Ink bleed or spread: Inkjet prints that have blurred from moisture show letter defects that OCR struggles with.

Background texture: Linen-textured paper, graph paper, or patterned backgrounds confuse the binarization step.

Scan quality

Skew: Pages scanned at an angle of more than 5–10 degrees see significantly higher error rates. Most scanners straighten pages automatically. Smartphone camera scans often have skew issues.

Glare: Reflective documents (glossy photos, thermal paper) can have bright spots that obscure characters.

Shadows: Document folds create shadows. Flatbed scanners minimize this; smartphone captures struggle.

Contrast: Low contrast between text and background (light gray ink on white paper, for instance) degrades accuracy.

Font and typography

Standard serif and sans-serif fonts: Nearly perfect recognition.

Decorative, handwritten, or calligraphic fonts: Significantly lower accuracy. OCR engines are trained primarily on printed text; handwriting is a distinct problem that requires handwriting-specific recognition models.

Small text: Below 8pt font size, pixel resolution becomes inadequate at 300 DPI for reliable recognition.

Mathematical notation and chemical formulas: Standard OCR handles running text. Equation-heavy documents need specialized math OCR (like Mathpix).

Languages and Multi-Language Documents

Tesseract supports 100+ languages with separate trained models for each. Quality varies significantly:

Well-supported (99%+ accuracy on clean documents):

English, German, French, Spanish, Italian, Dutch
Simplified and Traditional Chinese
Japanese, Korean
Arabic, Hebrew

Good support (95–98%):

Tamil, Hindi, Telugu, Malayalam, Kannada
Thai, Vietnamese
Polish, Czech, Russian

Limited support: Many minority languages have smaller training datasets and lower accuracy.

Multi-language documents (e.g., a Tamil-English contract) require either running OCR twice with different language models or using a multi-language model, which is slower but avoids duplication.

Our Scan PDF tool currently defaults to English. For Tamil and Hindi documents, selecting the appropriate language model significantly improves results.

What OCR Cannot Do

Handwriting recognition

Tesseract’s OCR model is trained on printed text. Cursive handwriting, printing in non-standard letterforms, and signatures are outside its training distribution. Accuracy on handwritten documents is typically 40–70% — useful for rough extraction, not for authoritative text capture.

Handwriting-specific recognition requires different models (Google’s Cloud Vision, for example, has a handwriting mode).

Tables with complex structure

OCR extracts text character sequences but doesn’t inherently understand table structure. A two-column table may be read as one long column, or alternating cells may be interleaved incorrectly. Post-processing tools (like Camelot for Python) specialize in table extraction from PDFs.

Complex mathematical formulas

Inline math in running text is recognized reasonably well if the symbols are common (×, ÷, ±, π). Complex display-mode equations with fractions, integrals, and stacked expressions are typically mangled. Mathpix and similar specialized tools handle this.

Dark or inverted text

White text on black backgrounds is typically not handled by standard binarization. A preprocessing step must invert the image before OCR.

Very small text

Below 8–10pt at 300 DPI, individual character pixels are too few for reliable recognition. Scanning at higher DPI helps here.

How to Get the Best Results From Browser OCR

Before scanning

Use a flatbed scanner at 300 DPI (not a phone camera if you can help it)
Clean the scanner glass
Ensure pages lie flat (no curl at edges)
Use black-and-white or grayscale mode (color scanning is 3× the data with no accuracy benefit for text)

Improving phone camera scans

Use the document scan mode in your phone’s camera app (it applies perspective correction)
Ensure strong, even lighting from above (avoid shadows)
Capture from directly above — not at an angle
Microsoft Office Lens and Adobe Scan preprocess images specifically for OCR

Processing in EssentialPDFTool

Open Scan PDF (OCR)
Upload your scanned PDF
Select language (match to the primary language of the document)
Click Start OCR — this runs entirely in your browser using Tesseract.js
Download the searchable PDF

OCR processing takes 5–30 seconds per page depending on image resolution and your device’s processing speed. A 20-page scanned document may take 3–5 minutes on an average laptop.

After OCR

Verify the output. Use Ctrl+F to search for key terms you know are in the document. Check recognition accuracy by selecting text in a passage and comparing to the visible image.

Compress the result. OCR adds a text layer but doesn’t remove the image data. Your searchable PDF will be approximately the same size as the original. Compress PDF can reduce it significantly.

OCR for Accessibility

Converting scanned PDFs to searchable format isn’t just about convenience — it’s an accessibility requirement. Screen readers used by blind and visually-impaired users cannot process image-only PDFs. After OCR:

Screen readers can read the document aloud
Keyboard navigation through text works
Text-to-braille translation is possible
Automated accessibility tools can audit the content

For fully accessible PDFs, OCR is the first step. Tagging the document structure (marking headings, lists, and tables) is the second. PDF/A-1a adds that structural layer. Our current OCR tool creates a basic searchable PDF; advanced accessibility tagging requires additional post-processing tools.

The Privacy Difference in Browser OCR

Most OCR services work by uploading your document to a cloud service that runs recognition on powerful servers. For confidential documents — legal records, medical files, financial statements — this creates obvious privacy concerns.

Tesseract.js runs the entire OCR pipeline inside your browser tab. Your document pages are rendered to Canvas elements in your browser’s memory, processed by a locally-executed WebAssembly module, and the text is written back into a new PDF — all without a single byte of your document content reaching a server.

This matters particularly for documents you’d hesitate to email to a stranger. Medical records, attorney-client communications, and personal identification documents can all be made searchable without leaving your device.

OCR technology that once required specialized server hardware now fits in a browser tab. That’s what makes genuinely private document processing possible.