LISTSERV - VA-HIST Archives - LISTLVA.LIB.VA.US

VA-HIST Archives

Discussion of research and writing about Virginia history

VA-HIST@LISTLVA.LIB.VA.US

	LISTSERV Archives
	VA-HIST Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Capturing old Text via OCR
From:	Jim Huffaker <[log in to unmask]>
Reply To:	Discussion of research and writing about Virginia history <[log in to unmask]>
Date:	Wed, 20 Mar 2002 22:15:47 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (155 lines)

I should have expanded my comments to say, I have several dozens of land
record images, of typed or old script documents, saved as GIF files. GIF
files are not large. I "index" them via the save, as function by: naming the
GIF file the same as the document ID and into common folders e.g Washington
County\VA. A scanned document can be saved as JPEG or GIF easy enough, but
if you have the proper digital camera, it can be set to save as GIF, saving
a step. Considering old documents are inevitably script, it seems to make
sense to set up and record them all (even the typed ones) as GIF files. In
other words, I do not see any advantage to a scan then OCR when an indexing
or naming scheme can locate a GIF image file that is readable in its
"original" form. There is one big disadvantage to recording GIF files via
digital camera. Illumination of the document must be absolutely uniform, and
that requires care in selection of light source and geometry of the setup.
Save as GIF file is a feature of most scanners, but scanning is relatively
slow (however uniform lighting is assured) vis a vis a camera.
Experimentation with the camera\lighting setup is necessary but rewarded by
a faster process of copying old documents. GIF Files can be called up by
even Microsoft's Paint program and manipulated as any other photofile, and,
of course, burned into a CD with a CD RW and removed  from the hard drive,
an easy way to copy files to share research.
----- Original Message -----
From: "Gene C. Harris" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Wednesday, March 20, 2002 3:50 PM
Subject: Re: Capturing old Text via OCR


> JUST A INPUT FROM ME.  I'VE HAD GOOD RESULTS WITH OMNI PAGE PROFESSIONAL
> VERSION 9.  I SCAN THE DOCUMENTS AND THEN HAVE THE OCR SOFTWARE DO THE
REST.
> I TOO, SAVE THE ORIGINALS AS AN IMAGE TO ACCOMPANY THE OCR'D DOCUMENT.
>
> GENE HARRIS
> RICHMOND.
>
> -----Original Message-----
> From: Discussion of research and writing about Virginia history
> [mailto:[log in to unmask]]On Behalf Of Charles L. Dibble (BLS
> 1338.733)
> Sent: Wednesday, March 20, 2002 3:07 PM
> To: [log in to unmask]
> Subject: Re: Capturing old Text via OCR
>
>
> I welcome this discussion ... because I am looking for a reliable system
for
> OCR.  And I hope someone has a better suggestion than mine.  I have
OmniPage
> 10 and an HP ScanJet with an automatic document feed.
>
> At the moment, I think the most practical course would probably be to
follow
> Jim Huffaker's suggestion.  Scan the documents and preserve the digital
> images, then do a "key word" index in a simple database that allows you to
> search by such things as date, subject matter, names etc.  As OCR
improves,
> the digital images could be "OCRed" at a later date.
>
>
 ===========================================================================
> =
> Charles L. Dibble
> Post Office Drawer 1240
> Columbia, South Carolina 29202-1240
> email: [log in to unmask]
>
 ===========================================================================
> =
>
> -----Original Message-----
> From: Discussion of research and writing about Virginia history
> [mailto:[log in to unmask]]On Behalf Of Jim Huffaker
> Sent: Wednesday, March 20, 2002 14:42
> To: [log in to unmask]
> Subject: Re: Capturing old Text via OCR
>
>
> Yes, I failed at a similar task but for different reasons. All my
documents
> were in outline form with several successive indentations per page. All
> indentations were lost or dislocated, but dropping the scanned OCRd
document
> into Word for spellchecking often gave a good guess for "misspelled"
words.
> Had the document's format been preserved, I think scanning, then OCR to
> spellchecker would have been successful enough. Regardless it is a lot of
> work.
>
>  I have bought CDs of typed information (Genealogical information), where
it
> is an image of the typed paper and not the OCR result of scanning. The
> provider created an index which would locate the image page where the
> information resides and that may be less work than a complete re-typing,
but
> is only searchable to the extent of good indexing. I could have done this
as
> well as made an image of the author's index, but 600 pages was a bit much.
>
> By the way getting images via a very good digital camera is, of course,
much
> faster than scanning. My camera does an excellent job of making images of
> pages, but it was a costly machine. I do not know of a really good
solution
> for that chore.
> ----- Original Message -----
> From: "Randy Cabell" <[log in to unmask]>
> To: <[log in to unmask]>
> Sent: Wednesday, March 20, 2002 1:54 PM
> Subject: Capturing old Text via OCR
>
>
> Is there any rule of thumb, or are their any guidelines for OCR vs
retyping
> of old documents?  I am looking into converting  minutes books of The
Cabell
> Foundation from 1955 - 2002 to editable (searchable) text.  OCR came to
mind
> first, since I have been very successful doing contemporary minutes of
> Boards of Supervisors and School Boards.
>
> But the early Cabell minutes were typed with a typewriter which formed
very
> poor characters, many not closed, downstrokes faint or missing on
characters
> like "m" and "p", etc.  Using Omini Page to OCR a page was a complete
> disaster. I had to intervene in about 40 cases, but it missed 70-100 or so
> words on the page completely because it did not recognize characters.  And
> of course the higher the intervention and error rates, the more time is
> required to proof the final copy to make sure it did not miss anything.
>
> At the moment, it looks like to me that if a page has more than a dozen or
> so 'interventions' required during the OCR process, then one is better off
> in just re-typing everything in initially.
>
> Any experience out there to share?
>
> Randy Cabell
>
> To subscribe, change options, or unsubscribe, please see the instructions
> at http://listlva.lib.va.us/archives/va-hist.html
>
> To subscribe, change options, or unsubscribe, please see the instructions
> at http://listlva.lib.va.us/archives/va-hist.html
>
> To subscribe, change options, or unsubscribe, please see the instructions
> at http://listlva.lib.va.us/archives/va-hist.html
>
> To subscribe, change options, or unsubscribe, please see the instructions
> at http://listlva.lib.va.us/archives/va-hist.html
>
>

To subscribe, change options, or unsubscribe, please see the instructions
at http://listlva.lib.va.us/archives/va-hist.html

ATOM RSS1 RSS2

LISTLVA.LIB.VA.US