| |
|
|
USING SCANNED TEXT
A common web application is to provide material that would otherwise be
available only in printed form. Ideally the material would be converted
to text and HTML, but as this is a labor-intensive process it isn't
always practical. Either manually typing the text or scanning the pages
and using optical character recognition (OCR) software to recreate the
text is time consuming and then more time is required for careful
proofreading and corrections. Any illustrations must be handled as
individual images if they are to be retained.
The alternative is to scan the printed pages and make these more-or-less
raw images available. The disadvantages of this approach are the much
larger file sizes (and therefore increased storage requirements and
download times) and the loss of normal text search capabilities. Also,
an image of a page of text will probably not be usable with assistive
technologies like screen readers. Nevertheless, scanned images of
printed pages often must be used. (A benefit of using images is to
preserve the original appearance of the page — typography, layout, etc.
— which may be important in some kinds of study. In some situations the
ideal might be to provide both an image and a text version of a page.)
To be most useful the scanned images require at least a little processing
before they are made available on the web.
This isn't a tutorial on scanning or image editing — you will need to
be familiar with the scanner you will be using and the software that
supports it. Some scanning software can perform the manipulations
described here, but you may need an image editing program like
Photoshop. Some software can be set up to perform the same operations on
a batch of images automatically, greatly reducing the amount of time
someone will be occupied preparing the images. We have some general
information on
using images and graphics.
(Also check the
IT Training
web pages for relevant courses.)
The first step is to scan the printed material. To capture normal type
and simple black and white illustrations, scanning at 300 d.p.i. and 256
gray levels is usually sufficient. Scan in color only when it is
necessary — color files will be much larger than gray scale or black and
white. Most scanning software will produce
TIFF
files. If possible use simple
LZW
compression to reduce the size of the files. Some scanning programs
produce
JPEG
files. If this is the case be sure to use the highest quality setting
to avoid loosing detail.
If the ability to print the scanned images will be needed, then 300
d.p.i. should be used for the images. If only screen display is
required, then 72 d.p.i. will be adequate and will reduce the size of
files to about 6% of the 300 d.p.i. files. It is still usually best to
scan the pages at 300 d.p.i. and then reduce them to 72 d.p.i. to
capture fine detail and get the best legibility. Changing a 300 d.p.i.
image to 72 d.p.i. and then scaling it to 24% will result in an image
about the same size as the original. It may be necessary to scale to
only 50% to maintain legibility, but this is still a significant space
saving.
In the following examples a 6.5 x 10 inch page is used. This is a little
smaller than some journal pages, but will give a good idea of the kind
of processing needed and sizes of the files involved. The page would
occupy about 4 kilobytes if it were converted to text and HTML form.
First the page has been scanned in 24 bit color mode at 300 d.p.i.
Without compression this produces a file of about 17 megabytes. Saving
the file with LZW compression yields a file of 12 megabytes with no loss
of data — a savings of about 30%. This is the simplest file to produce,
but is far from ideal. (Downloading a 12 megabyte file on a 56 kb
dial-up connection would take about 35 minutes. A ten-page journal
article would use 120 megabytes.)
Since this is a black and white page it can easily be converted to 256
gray levels, yielding a file of 4 megabytes. Adjusting the gamma and
contrast removes some of noise in the scanned image, makes it a little
easier to read, and further reduces the file size to less than 2
megabytes. It is often possible to further reduce the file to black and
white rather than gray scale (1 bit/pixel rather than 8 bits/pixel).
This yields nearly another order of magnitude saving — a file size of less
than 200 kilobytes. This kind of image will usually show ragged edges on
type, but maintain good legibility on screen and printed. It can actually
print more cleanly than a gray scale image.
The scanned pages can be presented a number of ways. Some web browsers
don't handle TIFF images well (or at all without a helper application),
so converting the images to GIF or JPEG format may be a good idea. GIF
is a lossless format that provides usually adequate compression for text
pages. JPEG is a lossy format that provides several levels of
compression. The higher the compression the more visual artifacts and
fuzziness will appear. In some cases GIF will make a smaller file than
JPEG. In some cases a JPEG will be smaller. Experiment. Newer browsers
also handle
PNG
format, so that may be another option to try. Even when a browser
supports an image format it may not do a very good job of presenting a
large image to the user. Printing often presents additional problems.
Since GIF is dimensionless — they are just raster images with no
resolution information — and not all applications handle JPEG
resolution settings well, printing them can sometimes be problematic. If
printing might be required it is possible to place images in a
PDF file
using Adobe Acrobat. (This can't be done with the free Adobe Acrobat
Reader, the full Acrobat product is required, but the resulting PDF file
can be displayed with the Reader.)
Multiple pages can be combined into a single PDF file, which increases
the size of the file, but a PDF can usually use progressive display and
begin presenting pages using the data as it arrives. PDF compression can
sometimes save a little additional space. Adobe Acrobat also does a
better job of presenting large images than most browsers, supporting
zooming and scrolling, so often PDF is a good choice even when printing
isn't a major concern.
This page was last updated on 2003-07-18.
Please direct questions and comments regarding this page to
webmaster@www.uky.edu.
|
|
A SCANNED TEXT EXAMPLE
Text detail as originally scanned — 24 bit color, 300 d.p.i, and no
processing. The
entire file, compressed,
is about 12 megabytes.
Text detail after conversion to gray scale and with gamma and contrast
adjustment. The
entire file
is less than 2 megabytes.
Text detail after conversion of the preceding example to black and
white. The
entire file
is less than 200 kilobytes.
From The Register of the Kentucky Historical Society,
volume 1, number 1, 1903. This two-page excerpt is available in PDF form in
two versions:
gray scale
(about 800 kilobytes) and
black & white
(about 350 kilobytes).
|