LISTSERV - VA-HIST Archives - LISTLVA.LIB.VA.US

Yes, I failed at a similar task but for different reasons. All my documents
were in outline form with several successive indentations per page. All
indentations were lost or dislocated, but dropping the scanned OCRd document
into Word for spellchecking often gave a good guess for "misspelled" words.
Had the document's format been preserved, I think scanning, then OCR to
spellchecker would have been successful enough. Regardless it is a lot of
work.

 I have bought CDs of typed information (Genealogical information), where it
is an image of the typed paper and not the OCR result of scanning. The
provider created an index which would locate the image page where the
information resides and that may be less work than a complete re-typing, but
is only searchable to the extent of good indexing. I could have done this as
well as made an image of the author's index, but 600 pages was a bit much.

By the way getting images via a very good digital camera is, of course, much
faster than scanning. My camera does an excellent job of making images of
pages, but it was a costly machine. I do not know of a really good solution
for that chore.
----- Original Message -----
From: "Randy Cabell" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Wednesday, March 20, 2002 1:54 PM
Subject: Capturing old Text via OCR


Is there any rule of thumb, or are their any guidelines for OCR vs retyping
of old documents?  I am looking into converting  minutes books of The Cabell
Foundation from 1955 - 2002 to editable (searchable) text.  OCR came to mind
first, since I have been very successful doing contemporary minutes of
Boards of Supervisors and School Boards.

But the early Cabell minutes were typed with a typewriter which formed very
poor characters, many not closed, downstrokes faint or missing on characters
like "m" and "p", etc.  Using Omini Page to OCR a page was a complete
disaster. I had to intervene in about 40 cases, but it missed 70-100 or so
words on the page completely because it did not recognize characters.  And
of course the higher the intervention and error rates, the more time is
required to proof the final copy to make sure it did not miss anything.

At the moment, it looks like to me that if a page has more than a dozen or
so 'interventions' required during the OCR process, then one is better off
in just re-typing everything in initially.

Any experience out there to share?

Randy Cabell

To subscribe, change options, or unsubscribe, please see the instructions
at http://listlva.lib.va.us/archives/va-hist.html

To subscribe, change options, or unsubscribe, please see the instructions
at http://listlva.lib.va.us/archives/va-hist.html