Skip to main content
wcag21aa.org

Scanned vs. born-digital PDFs

The most important structural distinction in a PDF estate. Scanned and born-digital PDFs look the same on screen but require fundamentally different remediation work.

By Levi Whitted Last reviewed: Published:

Why the distinction matters

Two PDFs that look identical on screen can require vastly different remediation work. A born-digital PDF may need only tag corrections to reach WCAG 2.1 AA conformance. A scanned image PDF needs optical character recognition before any structural work can even begin, then requires the same tagging work on top.

The cost difference is not marginal. Per-page remediation cost for a scanned PDF is typically two to four times the cost for a born-digital PDF of the same length, depending on document complexity and tool choice. For estate-level planning, the split between scanned and born-digital is the single biggest driver of total budget.

Scanned image PDFs

A scanned PDF is essentially a photograph (or set of photographs) of paper, embedded in PDF format. The pages contain image data, not text. Screen readers encounter nothing readable; selection tools cannot select text; search returns no results. From an accessibility standpoint, a scanned PDF without OCR is functionally inaccessible to anyone using assistive technology.

Scanned PDFs are common in public entities for predictable reasons:

  • Historical board minutes, agendas, and reports from before electronic authoring workflows
  • Records produced by capture-and-archive workflows (scan paper, post as PDF)
  • Documents received from outside parties (vendor reports, legal filings, third-party correspondence) and posted as received
  • Records pulled from physical archives in response to public records requests, then retained online

Remediation requires, at minimum:

  1. OCR (optical character recognition) to extract a text layer from the page images
  2. Manual review of the OCR output to correct recognition errors, especially in tables, multi-column layouts, handwriting, and degraded source material
  3. Structural tagging to add reading order, headings, list structure, and table semantics
  4. Alt text on meaningful images embedded within the scan
  5. Verification with a screen reader and an automated checker

For long scanned documents with tables, multi-column layouts, or complex formatting, this is substantial labor. A 200-page scanned legacy report can take a trained remediator 8 to 20 hours to bring to WCAG 2.1 AA conformance, depending on source quality.

Born-digital PDFs

A born-digital PDF was created electronically. It was exported from Word, generated by a financial system, produced by a CMS, or assembled by a design tool. The pages contain actual text characters; the document has a text layer; search and copy-paste work.

A born-digital PDF still may or may not be accessible. It depends on whether the authoring workflow produced structural tagging. Common scenarios:

  • Tagged at authoring. Modern Word with appropriate styles, exported via "Save As PDF" with the right settings, can produce a tagged PDF that needs only minor cleanup. This is the cheapest case.
  • Untagged but flat-content. A short born-digital PDF (a one-page flyer, a simple letter) may have no tagging but minimal structure that needs tagging. Remediation is fast.
  • Untagged with complex structure. A long born-digital PDF with headings, lists, tables, and multi-column layouts that was exported without tagging needs full structural tagging work. Less expensive than scanned, more expensive than the simple cases.
  • Tagged but incorrectly. A PDF generated by a system that exports tags by default but assigns them poorly (every paragraph tagged as <P>, headings tagged inconsistently, tables flattened) can be harder to remediate than an untagged document, because the bad tags have to be cleaned up before correct tags can be applied.

Different remediation paths

The two document types call for different tooling and workflow.

Step Scanned Born-digital
OCR Required, with manual review Not needed (text layer exists)
Reading order Often broken by OCR and must be set manually Usually correct from authoring, may need verification
Tagging Full tagging from scratch Tag correction or full tagging depending on source
Tables OCR rarely captures table structure; manual rebuild typical May have table tags from authoring; verify cells and headers
Alt text on embedded images Common; images within the scan must be identified and described Common; same review needed
Typical per-page time 20 to 90 minutes depending on complexity 5 to 30 minutes depending on starting state

For born-digital documents that go through a recurring authoring workflow (board agendas exported each meeting, schedules republished each term), the highest leverage is upstream: fix the authoring template and export settings so future documents are accessible from creation. Retroactive remediation of past documents is then a finite catch-up project rather than ongoing labor.

For scanned documents, the question of remediate-vs-replace is sharper. Many scanned documents would be cheaper to retype as HTML pages than to remediate. See HTML vs. remediation.

How to tell which one you have

Three quick tests distinguish a scanned PDF from a born-digital one:

1. Try to select text

Open the PDF in any viewer and try to highlight a passage of text. If the highlight selects character-by-character (you can copy individual words), the PDF is born-digital or scanned-then-OCR'd. If the highlight selects rectangular regions of the page image, the PDF is scanned and has no text layer.

2. Search for a word

Use the PDF viewer's find function to search for a word you can see on the page. A born-digital or OCR'd PDF returns a match. A pure scanned PDF returns nothing.

3. Check file size relative to length

Scanned PDFs are typically much larger per page than born-digital ones, because they contain image data rather than text. A 20-page born-digital PDF is usually a few hundred KB. A 20-page scanned PDF can easily be 5 to 20 MB.