Beyond the Shelf Technology
The Ideology
Image Capture
Quality Control
Optical Character Recognition
Document Structuring
Lessons Learned
The Ideology
Preservation microfilm remains the preferred choice for long term access
to text-based sources.
Properly
prepared and stored, 35mm silver halide master negative microfilm has a
life expectancy of 500 years. Digital images, while not as persistent, permit
nimble access and wide distribution. A hybrid approach employing the strengths
of both technologies provides a strategy that minimizes the use of the imperiled
source documents, ensures a low maintenance preservation copy and develops
the dynamic dimensions of digital solutions.
Image Capture
Beyond
the Shelf Image Management Specialists capture 400 dpi bitonal TIFF
images from the 2nd generation print master using two Mekel
M525 microfilm scanners. Although the scanner has an automatic image
capture feature, the Specialists operate the scanner in manual to mode to
obtain the most precise image capture. Most of the titles are filmed in
a “2-page per frame” orientation. The Mekel scanner “splits”
the duplex pages into single leaves.
When illustrations, charts or maps are better rendered in grayscale, they are captured as 8-bit 400 dpi images. Microfilm targets, such as eye-legible title targets and lists of irregularities, are scanned as their information is valuable during the encoding processes. In the final presentation, they are not displayed to the public. Blank pages are targeted as such. These are presented to the public view.
Quality Control
The Image Management Specialists run a series of scripts on batches of images to crop, align, de-speckle, de-skew and rename the files. In a final pass, significant marginalia or obscuring noise is removed using Adobe Photoshop. The digital master files are burned to CDRs. The digital master CDR is copied and immediately stored offsite.
Optical Character Recognition
The Digital Masters are used to perform the optical character recognition for generation of the searchable text. To assure a high level of accuracy in the recognition process, PrimeRecognition OCR software utilizing 3 recognition engines and voting technology is used. The OCR text files are given the same sequential number as their page image.
Document Structuring
Beyond the Shelf texts are encoded using an XML structural markup language adhering to the Digital Library Federation specifications outlined in TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices. These guidelines outline five increasingly granular levels of encoding. These levels are established so that libraries can match encoding guidelines to specific project goals.
One of the goals of Beyond the Shelf is to build a sustainable, high-production workflow for producing electronic texts utilizing the microfilm to digital page image approach. Towards that end, automation is used whenever possible. Additionally, the project is predicated on the use a non-proprietary encoding format that is both extensible and interoperable. The encoding format must facilitate full-text searching capabilities as well as basic document hierarchy structure for table of contents presentation. As specified in TEI Text Encoding in Libraries: Guidelines, the Minimal Encoding (Level 2) approach is utilized to create electronic text for keyword searching which resides along with the basic book structure for basic navigation of the digital version. The TEI-Lite (Text Encoding Initiative) encoding scheme, which conforms to the XML standard, is used as the markup language.
An Encoding Technician navigates the scanned page images and creates a spreadsheet for each book recording page number, image file-name and sequence, as well as major document structuring specifics. Encoded headers for the texts are automatically created from saved drafts of the MARC records for the digital titles. These headers have tags indicating the start and end of the document instance as well as a completed TEI-Lite header. Individual OCR pages are automatically given end of page markers (ASCII character 12) by the optical character recognition process. These place markers are used to insert encoding chunks specifying page breaks in the text. Once the encoding chunks have been added to the individual OCR page files, the individual OCR pages are merged into one file. The merged files comprise the body of the text without the header or beginning and end of document tags. The merged OCR files are then joined with their encoded header counterparts, producing a complete document instance for each book.
After validation by an XML parser (James Clark’s SP software) to ensure correct encoding structure, each XML document is used to create a derivative file used for current online delivery via indexing on a server. The online document is then viewed through the computer interface, currently Dynaweb/Site Search, and checked for correct page image and text pairing.
Lessons Learned
In the brief time Beyond the Shelf has been in action, we’ve learned some very significant lessons.
- The Mekel 525 microfilm scanner’s Windows NT operating system was not reliable. An upgrade to Windows 2000 in July 2003 has improved throughput and reduced downtime.
- Quality takes time! Creating a digital reproduction to the level of
quality of the original (i.e. legible images with legible text) significantly
increases production time and operation costs. As such, we quickly realized
we had but two choices:
- Sacrifice the quality of the final file image in order to create a large volume of titles quickly, during the two-year project; or
- Sacrifice large volume of titles for fewer, high quality images.
- The aesthetic of cleaned and separated page images versus unaltered, double page images are more desirable to patrons and more closely resemble our envisioned model of digital preservation of the original source document. Therefore, for the foreseeable future, Beyond the Shelf will focus on quality vs. quantity. However, the objective remains to achieve a balance.
- Many more source documents contain images, either photographs or line drawings, than anticipated. These render much better in grayscale than bitonal, and thus, require more time for image capture and editing.
- Quality costs! By carefully tracking time spent on image capture and quality control, the per page cost for those steps averages at .62/page. The expectation is that costs will diminish as the Image Management Specialists gained expertise. Windows 2000 will increase throughput and reduce costs. Project Managers are deconstructing the workflow to identify tasks that can be efficiently performed by students. The use of student helpers will reduce the costs. But, before these tasks can be delegated, they must be thoroughly understood by the Image Management Specialists and the Project Managers. Therefore,
- Learning Curves cost!
- Multi-tasking saves time and money – technicians who have access to more than one computer can assign a variety to tasks to each unit and decrease production time.
- While the Beyond the Shelf staff have always embraced the logic that consistency, the use of standards and the use of interoperable, non-proprietary technology increase seamless migrations and/or transitions; they are experiencing that first hand as they migrate from Dynaweb to the University of Michigan’s purpose-built DLXS platform.
- Using purpose-built software created by other institutions is better suited to our needs than vendor software.
- Begin time tracking and cost analysis after you establish a good workflow.
- Promote the project tirelessly! Start with your colleagues within the institution. You will want to make sure they understand your project and your goals so that they can be effective in their networking.
- When working with microfilm, you become aware of inconsistent methods of targeting for second exposures. Sometimes targets are present, and sometimes they are not. And worse, sometimes second exposures are not made when they probably should have been. This inconsistency can reduce production throughput, because the scanner operator must make more time-consuming evaluations as they proceed through the film.
