A common web application is to provide material that would otherwise be available only in printed form. Ideally the material would be converted to text and HTML, but as this is a labor-intensive process it isn't always practical. Either manually typing the text or scanning the pages and using optical character recognition (OCR) software to recreate the text is time consuming and then more time is required for careful proofreading and corrections. Any illustrations must be handled as individual images if they are to be retained.
The alternative is to scan the printed pages and make these more-or-less raw images available. The disadvantages of this approach are the much larger file sizes (and therefore increased storage requirements and download times) and the loss of normal text search capabilities. Also, an image of a page of text will probably not be usable with assistive technologies like screen readers. Nevertheless, scanned images of printed pages often must be used. (A benefit of using images is to preserve the original appearance of the page – typography, layout, etc. – which may be important in some kinds of study. In some situations the ideal might be to provide both an image and a text version of a page.) To be most useful the scanned images require at least a little processing before they are made available on the web.
This isn't a tutorial on scanning or image editing – you will need to be familiar with the scanner you will be using and the software that supports it. Some scanning software can perform the manipulations described here, but you may need an image editing program like Photoshop. Some software can be set up to perform the same operations on a batch of images automatically, greatly reducing the amount of time someone will be occupied preparing the images. We have some general information on using images and graphics.
The first step is to scan the printed material. To capture normal type and simple black and white illustrations, scanning at 300 d.p.i. and 256 gray levels is usually sufficient. Scan in color only when it is necessary – color files will be much larger than gray scale or black and white. Most scanning software will produce TIFF files. If possible use simple LZW compression to reduce the size of the files. Some scanning programs produce JPEG files. If this is the case be sure to use the highest quality setting to avoid loosing detail.
If the ability to print the scanned images will be needed, then 300 d.p.i. should be used for the images. If only screen display is required, then 72 d.p.i. will be adequate and will reduce the size of files to about 6% of the 300 d.p.i. files. It is still usually best to scan the pages at 300 d.p.i. and then reduce them to 72 d.p.i. to capture fine detail and get the best legibility. Changing a 300 d.p.i. image to 72 d.p.i. and then scaling it to 24% will result in an image about the same size as the original. It may be necessary to scale to only 50% to maintain legibility, but this is still a significant space saving.
In the following examples a 6.5 x 10 inch page is used. This is a little smaller than some journal pages, but will give a good idea of the kind of processing needed and sizes of the files involved. The page would occupy about 4 kilobytes if it were converted to text and HTML form.
First the page has been scanned in 24-bit color mode at 300 d.p.i. Without compression this produces a file of about 17 megabytes. Saving the file with LZW compression yields a file of 12 megabytes with no loss of data – a savings of about 30%. This is the simplest file to produce, but is far from ideal. (Downloading a 12 megabyte file on a 56 kb dial-up connection would take about 35 minutes. A ten-page journal article would use 120 megabytes.)
Since this is a black and white page it can easily be converted to 256 gray levels, yielding a file of 4 megabytes. Adjusting the gamma and contrast removes some of noise in the scanned image, makes it a little easier to read, and further reduces the file size to less than 2 megabytes. It is often possible to further reduce the file to black and white rather than gray scale (1 bit/pixel rather than 8 bits/pixel). This yields nearly another order of magnitude saving – a file size of less than 200 kilobytes. This kind of image will usually show ragged edges on type, but maintain good legibility on screen and printed. It can actually print more cleanly than a gray scale image.
The scanned pages can be presented a number of ways. Some web browsers don't handle TIFF images well (or at all without a helper application), so converting the images to GIF or JPEG format may be a good idea. GIF is a lossless format that provides usually adequate compression for text pages. JPEG is a lossy format that provides several levels of compression. The higher the compression the more visual artifacts and fuzziness will appear. In some cases GIF will make a smaller file than JPEG. In some cases a JPEG will be smaller. Experiment. Newer browsers also handle PNG format, so that may be another option to try. Even when a browser supports an image format it may not do a very good job of presenting a large image to the user. Printing often presents additional problems.
Since GIF is dimensionless – they are just raster images with no resolution information – and not all applications handle JPEG resolution settings well, printing them can sometimes be problematic. If printing might be required it is possible to place images in a PDF file using Adobe Acrobat. (This can't be done with the free Adobe Acrobat Reader, the full Acrobat product is required, but the resulting PDF file can be displayed with the Reader.) Multiple pages can be combined into a single PDF file, which increases the size of the file, but a PDF can usually use progressive display and begin presenting pages using the data as it arrives. PDF compression can sometimes save a little additional space. Adobe Acrobat also does a better job of presenting large images than most browsers, supporting zooming and scrolling, so often PDF is a good choice even when printing isn't a major concern. It is also possible to include both text and image within a PDF.
A SCANNED TEXT EXAMPLE