Scanned vs. born-digital PDFs
The most important structural distinction in a PDF estate. Scanned and born-digital PDFs look the same on screen but require fundamentally different remediation work.
Why the distinction matters
Two PDFs that look identical on screen can require vastly different remediation work. A born-digital PDF may need only tag corrections to reach WCAG 2.1 AA conformance. A scanned image PDF needs optical character recognition before any structural work can even begin, then requires the same tagging work on top.
The cost difference is not marginal. Per-page remediation cost for a scanned PDF is typically two to four times the cost for a born-digital PDF of the same length, depending on document complexity and tool choice. For estate-level planning, the split between scanned and born-digital is the single biggest driver of total budget.
Scanned image PDFs
A scanned PDF is essentially a photograph (or set of photographs) of paper, embedded in PDF format. The pages contain image data, not text. Screen readers encounter nothing readable; selection tools cannot select text; search returns no results. From an accessibility standpoint, a scanned PDF without OCR is functionally inaccessible to anyone using assistive technology.
Scanned PDFs are common in public entities for predictable reasons:
- Historical board minutes, agendas, and reports from before electronic authoring workflows
- Records produced by capture-and-archive workflows (scan paper, post as PDF)
- Documents received from outside parties (vendor reports, legal filings, third-party correspondence) and posted as received
- Records pulled from physical archives in response to public records requests, then retained online
Remediation requires, at minimum:
- OCR (optical character recognition) to extract a text layer from the page images
- Manual review of the OCR output to correct recognition errors, especially in tables, multi-column layouts, handwriting, and degraded source material
- Structural tagging to add reading order, headings, list structure, and table semantics
- Alt text on meaningful images embedded within the scan
- Verification with a screen reader and an automated checker
For long scanned documents with tables, multi-column layouts, or complex formatting, this is substantial labor. A 200-page scanned legacy report can take a trained remediator 8 to 20 hours to bring to WCAG 2.1 AA conformance, depending on source quality.
Born-digital PDFs
A born-digital PDF was created electronically. It was exported from Word, generated by a financial system, produced by a CMS, or assembled by a design tool. The pages contain actual text characters; the document has a text layer; search and copy-paste work.
A born-digital PDF still may or may not be accessible. It depends on whether the authoring workflow produced structural tagging. Common scenarios:
- Tagged at authoring. Modern Word with appropriate styles, exported via "Save As PDF" with the right settings, can produce a tagged PDF that needs only minor cleanup. This is the cheapest case.
- Untagged but flat-content. A short born-digital PDF (a one-page flyer, a simple letter) may have no tagging but minimal structure that needs tagging. Remediation is fast.
- Untagged with complex structure. A long born-digital PDF with headings, lists, tables, and multi-column layouts that was exported without tagging needs full structural tagging work. Less expensive than scanned, more expensive than the simple cases.
- Tagged but incorrectly. A PDF generated by a system that exports tags by default but assigns them poorly (every paragraph tagged as <P>, headings tagged inconsistently, tables flattened) can be harder to remediate than an untagged document, because the bad tags have to be cleaned up before correct tags can be applied.
Different remediation paths
The two document types call for different tooling and workflow.
| Step | Scanned | Born-digital |
|---|---|---|
| OCR | Required, with manual review | Not needed (text layer exists) |
| Reading order | Often broken by OCR and must be set manually | Usually correct from authoring, may need verification |
| Tagging | Full tagging from scratch | Tag correction or full tagging depending on source |
| Tables | OCR rarely captures table structure; manual rebuild typical | May have table tags from authoring; verify cells and headers |
| Alt text on embedded images | Common; images within the scan must be identified and described | Common; same review needed |
| Typical per-page time | 20 to 90 minutes depending on complexity | 5 to 30 minutes depending on starting state |
For born-digital documents that go through a recurring authoring workflow (board agendas exported each meeting, schedules republished each term), the highest leverage is upstream: fix the authoring template and export settings so future documents are accessible from creation. Retroactive remediation of past documents is then a finite catch-up project rather than ongoing labor.
For scanned documents, the question of remediate-vs-replace is sharper. Many scanned documents would be cheaper to retype as HTML pages than to remediate. See HTML vs. remediation.
How to tell which one you have
Three quick tests distinguish a scanned PDF from a born-digital one:
1. Try to select text
Open the PDF in any viewer and try to highlight a passage of text. If the highlight selects character-by-character (you can copy individual words), the PDF is born-digital or scanned-then-OCR'd. If the highlight selects rectangular regions of the page image, the PDF is scanned and has no text layer.
2. Search for a word
Use the PDF viewer's find function to search for a word you can see on the page. A born-digital or OCR'd PDF returns a match. A pure scanned PDF returns nothing.
3. Check file size relative to length
Scanned PDFs are typically much larger per page than born-digital ones, because they contain image data rather than text. A 20-page born-digital PDF is usually a few hundred KB. A 20-page scanned PDF can easily be 5 to 20 MB.