Beyond the Shelf Technology

The Ideology
Image Capture
Quality Control
Optical Character Recognition
Document Structuring
Lessons Learned


The Ideology

Preservation microfilm remains the preferred choice for long term access to text-based sources. microfilmProperly prepared and stored, 35mm silver halide master negative microfilm has a life expectancy of 500 years. Digital images, while not as persistent, permit nimble access and wide distribution. A hybrid approach employing the strengths of both technologies provides a strategy that minimizes the use of the imperiled source documents, ensures a low maintenance preservation copy and develops the dynamic dimensions of digital solutions.


Image Capture

Mekel scannerBeyond the Shelf Image Management Specialists capture 400 dpi bitonal TIFF images from the 2nd generation print master using two Mekel M525 microfilm scanners. Although the scanner has an automatic image capture feature, the Specialists operate the scanner in manual to mode to obtain the most precise image capture. Most of the titles are filmed in a “2-page per frame” orientation. The Mekel scanner “splits” the duplex pages into single leaves.

When illustrations, charts or maps are better rendered in grayscale, they are captured as 8-bit 400 dpi images. Microfilm targets, such as eye-legible title targets and lists of irregularities, are scanned as their information is valuable during the encoding processes. In the final presentation, they are not displayed to the public. Blank pages are targeted as such. These are presented to the public view.

Quality Control

The Image Management Specialists run a series of scripts on batches of images to crop, align, de-speckle, de-skew and rename the files. In a final pass, significant marginalia or obscuring noise is removed using Adobe Photoshop. The digital master files are burned to CDRs. The digital master CDR is copied and immediately stored offsite.

Optical Character Recognition

The Digital Masters are used to perform the optical character recognition for generation of the searchable text. To assure a high level of accuracy in the recognition process, PrimeRecognition OCR software utilizing 3 recognition engines and voting technology is used. The OCR text files are given the same sequential number as their page image.

Document Structuring

Beyond the Shelf texts are encoded using an XML structural markup language adhering to the Digital Library Federation specifications outlined in TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices. These guidelines outline five increasingly granular levels of encoding. These levels are established so that libraries can match encoding guidelines to specific project goals.

One of the goals of Beyond the Shelf is to build a sustainable, high-production workflow for producing electronic texts utilizing the microfilm to digital page image approach. Towards that end, automation is used whenever possible. Additionally, the project is predicated on the use a non-proprietary encoding format that is both extensible and interoperable. The encoding format must facilitate full-text searching capabilities as well as basic document hierarchy structure for table of contents presentation. As specified in TEI Text Encoding in Libraries: Guidelines, the Minimal Encoding (Level 2) approach is utilized to create electronic text for keyword searching which resides along with the basic book structure for basic navigation of the digital version. The TEI-Lite (Text Encoding Initiative) encoding scheme, which conforms to the XML standard, is used as the markup language.

An Encoding Technician navigates the scanned page images and creates a spreadsheet for each book recording page number, image file-name and sequence, as well as major document structuring specifics. Encoded headers for the texts are automatically created from saved drafts of the MARC records for the digital titles. These headers have tags indicating the start and end of the document instance as well as a completed TEI-Lite header. Individual OCR pages are automatically given end of page markers (ASCII character 12) by the optical character recognition process. These place markers are used to insert encoding chunks specifying page breaks in the text. Once the encoding chunks have been added to the individual OCR page files, the individual OCR pages are merged into one file. The merged files comprise the body of the text without the header or beginning and end of document tags. The merged OCR files are then joined with their encoded header counterparts, producing a complete document instance for each book.

After validation by an XML parser (James Clark’s SP software) to ensure correct encoding structure, each XML document is used to create a derivative file used for current online delivery via indexing on a server. The online document is then viewed through the computer interface, currently Dynaweb/Site Search, and checked for correct page image and text pairing.

Lessons Learned

In the brief time Beyond the Shelf has been in action, we’ve learned some very significant lessons.

decoration