What Is Backfile Conversion?
A practical overview for county offices planning or evaluating a backfile project.
Definition
Backfile conversion is the process of taking previously recorded documents — often stored as unindexed scans, microfilm, or paper — and converting them into searchable, indexed digital records. The goal is to make historical records as findable and usable as newly recorded documents.
A backfile project typically involves several stages: scanning (if records aren't already digital), image cleanup, OCR, metadata extraction, quality review, and import into a records management or land records system.
Why public offices do it
Most county offices have years or decades of records that were scanned but never indexed with structured metadata. These documents exist as image files — technically digital, but practically unfindable without knowing exactly where to look.
Backfile conversion makes those records searchable. Once indexed, staff can find documents by name, date, document type, parcel number, or instrument number instead of browsing folder structures or relying on institutional knowledge.
Common triggers include:
- Migrating to a new document management or land records system
- Responding to audit findings about records accessibility
- Reducing the time it takes to fulfill public records requests
- Retiring legacy systems that are no longer supported
- Consolidating records from multiple departments or offices
Backfile vs. day-forward processing
Day-forward processing is the ongoing work of indexing new documents as they come in — each deed, mortgage, or lien gets indexed as it's recorded. Most offices already have a day-forward workflow, even if it's manual.
Backfile conversion is the catch-up work: going back through months, years, or decades of previously recorded documents that were stored but never properly indexed. The two workflows often run in parallel — day-forward continues while backfile progresses in batches.
The practical challenge is that backfile work can't usually disrupt day-forward operations. Staff capacity, system access, and QC bandwidth all need to accommodate both. Projects that plan for this from the start tend to go more smoothly.
What's involved
Scanning and image preparation
If records are still on paper or microfilm, the first step is scanning. For OCR to work reliably, scans should be at least 300 DPI, properly oriented, and free of heavy skew or noise. Image preparation — deskewing, cropping, despeckling — improves downstream extraction quality.
If records were scanned previously, the quality of those scans determines whether re-scanning is necessary. Low-resolution or poorly captured images often produce unreliable OCR output.
OCR and text extraction
Optical character recognition converts scanned images into machine-readable text. Accuracy depends on scan quality, document age, font clarity, and whether the text is typed or handwritten. Modern OCR engines handle clean, typed documents well. Older documents with faded ink, stamps over text, or handwritten entries produce lower accuracy and require more human review.
For a deeper look at how OCR fits into AI-assisted workflows, see the AI document indexing guide.
Metadata extraction and indexing
Once OCR text is available, metadata extraction identifies the structured fields needed for indexing: document type, recording date, grantor and grantee names, legal descriptions, parcel numbers, instrument numbers, and more.
This can be done manually (staff reading and keying in fields), semi-automatically (software suggests values, staff confirm), or with AI-assisted tools that extract fields and flag exceptions for review. The right approach depends on volume, budget, and accuracy requirements.
Quality control and exception review
No extraction method is perfect. Quality control involves reviewing extracted data against the source document, correcting errors, and handling edge cases — documents the system couldn't classify, fields it couldn't extract, or values that don't match expected patterns.
A well-designed QC workflow catches errors before data reaches the target system. The reindexing and QC guide covers this stage in detail.
Import into the target system
The final step is loading indexed data into the destination system — a land records platform, a document management system, or a state portal. This requires mapping extracted fields to the target schema, validating data against business rules, and handling records that fail validation.
Import is often more complex than it appears. Field formats, naming conventions, and required fields can differ between systems. Testing with sample batches before a full import helps catch mapping errors early.
Planning a backfile project
Before starting, offices should consider:
- Scope: Which document types and date ranges to include. Trying to do everything at once often leads to delays.
- Source material condition: The quality of existing scans or physical records directly affects OCR accuracy and project timeline.
- Index fields: Define exactly which metadata fields are needed for each document type before extraction begins.
- Acceptance criteria: What accuracy rate is acceptable? How will QC be measured? Define this upfront.
- Target system requirements: Understand the import format and validation rules of the destination system before starting extraction.
- Staffing: Determine whether the project will be handled in-house, by a vendor, or a combination.
Common risks
Backfile projects often take longer and cost more than initial estimates suggest. The most common reasons:
- Exception volume is higher than expected. Teams estimate based on clean documents but undercount the records that need manual review — older documents, poor scans, and unusual formats all increase exceptions.
- Target system import is treated as an afterthought. Starting extraction before defining the target schema and field mapping leads to rework. Test imports early, not at the end.
- OCR accuracy varies across document types and eras. A model that works well on 2010-era typed deeds may struggle with 1970s handwritten instruments. Pilot across representative samples, not just easy ones.
- Multi-page and attachment handling is overlooked. Documents that span multiple pages or include embedded exhibits often break workflows built for single-page instruments.
- No pilot phase. Going straight to full-volume production without piloting on a representative sample is the most common source of project delays.
- Staff capacity isn't accounted for. If the same team handles both day-forward and backfile, their available hours for backfile are less than planned.
Disclaimer: This guide is educational in nature. It is not legal advice or a substitute for consulting with your office's legal counsel or state records management agency.
Frequently Asked Questions
Related Guides
AI Document Indexing for County Records
How AI-assisted indexing works — OCR, extraction, exception review, and realistic expectations.
Read guideReindexing, Quality Control, and Imports
Cleaning up legacy index data, building QC workflows, and importing into downstream systems.
Read guidePublic Records Indexing in Ohio
State-specific guide for Ohio county recorders — auditor pre-approval, ORC formatting standards, and NHPRC digitization grants.
Read guidePublic Records Indexing in Illinois
State-specific guide for Illinois county recorders of deeds — race-notice recording, Cook County merger, and digitization grants.
Read guide