home | about | digitized newspapers | ky-ndnp team | tech talk
KY-NDNP in cooperation with NEH and the Library of Congress has developed its 2-day lecture series meta|morphosis into a collection of self-paced learning tutorials and quizzes now freely available to anyone around the world with internet access. With meta|morphosis you'll learn about historic newspapers, microfilm, digital technologies, storage, metadata, and more. To read specifically about the KY-NDNP experinece and methodologies, simply scroll down.
media | faq | blog | facebook| twitter | youtube
digital library services | kentucky digital library | chronicling america
Contract All | Expand All
For NDNP, sometimes it's not about what you want to digitize, but what you have on microfilm that determines what you can digitize. KY-NDNP is in a unique situation since we have some 30,000 reels of master negative microfilm at our fingertips, literally. Housed in our on-site climate controlled vault, the reels and the rights to their content reside solely with the University of Kentucky Libraries; a pivotal factor in the selection process. Nevertheless, there are other NDNP rules for content selection:
- Be aware of your content's date range
- This is a lesser concern today than in past award cycles. All the same, the maximum time span for NDNP content is 1836-1922. Microfilm generally does not follow such protocol and, therefore, may include restricted issues on reels containing acceptable content.
- The University of Kentucky Libraries has created an excellent microfilm database of its holdings. KY-NDNP can easily identify title reels in any date range using the database. Other micropublishers or holders of microfilm likely have a similar database. Use them.
- NOTE: We've heard of an instance when a micropublisher's holdings list suggested X number of titles, yet their internal database listed far fewer. What happened? The micropublisher spliced all of the smaller 100' reels together to make 1000' reels, then, they listed only the first title on the 1000' reel.
- Do your homework and be diligent with what you know!
- Choose a title that is as complete as possible (orphan titles not withstanding)
- This is very important when it comes to meeting the required 100,000 pages. You could choose papers with few issues but you'll have to choose more titles in order to meet that page count target. More to the point: each unique title requires an essay, and writing those can be a full-time job in itself. Plan accordingly.
- NDNP Guidelines allow a single essay for title families. For instance, KY-NDNP wrote a single essay for the Paducah Sun, Paducah Daily Sun, Paducah Evening Sun, Paducah Evening Sun Weekly Edition, and the Sunday Chat - all of which are included in Chronicling America.
- Find the paper of record for the community/region it serves
- NDNP suggests you first choose your state's newspaper of record. This wasn't possible for KY-NDNP. We did not own the microfilm. The micropublisher that did was unwilling to grant access to, or even sell copies of, those materials. Our only plausable alternative was to choose several titles based on their regional prowess or orphan status (NDNP loves orphaned titles almost as much as a newspaper of record). But this meant choosing 37 titles instead of just one. (SEE: previous bold point)
- NOTE: Kentucky's newspaper of record - Louisville's Courier-Journal - is now available in the Kentucky Digital Library in keyword searchable full color.
- You may find yourself in a similar situation with a micropublisher. If so, your state's regional and/or cultural disposition could decide how you start your digitization and/or NDNP program.
- Other NDNP awadees have had both failed and successful ventures with micropublishers. We encourage you to tap them for their experinece, particularly Library of Virginia, University of North Texas, or Washington State Library
- You, and/or your advisory board, may well be faced with factors in title selection not listed here.
- Some pre-standards microfilm, i.e. pre-USNP, cannot sustain quality digital imaging
- KY-NDNP is at a real advantage in that we have an in-house microfilm and duplication facility with the expertise to manipulate poor quality master negatives into print master negatives suitable for scanning.
- For those without in-house facilities, there's reason for hope. With our "pre-standards" film we've found that:
- 99% of our "pre-standards" microfilm is adequate for digitization
- Film might lack technical targets to measure resolution, but it's focused and perfectly legible
- Reduction ratios for IA and IIB film are usually 20x or less which means you can achieve at least 300dpi scans in most cases
- Though lighting is often varied throughout a reel, it is usually within acceptable density ranges
- Though densities may vary from reel to reel, or even within a reel, imperfect densities can still produce legible digital scans.
- A good duplicating technician can usually improve under-exposed or over-exposed master negative quality through film stock alone i.e. using high or low contrast films for the duplicate. Talk to your microfilm manufacturer at length about what you need and reference the RLG Preservation Microfilming Handbook when necessary.
Fortunately, Digital Library Services (DLS) provides KY-NDNP with microfilm readers, densitometers, and microscopes necessary for the inspection of master and print master negatives in addition to a full-srrvice darkroom for to create the highest quality second-generation negatives possible.
- If you are interested in doing the technical inspection of your master and/or print master microfilm but don't have the expertise or a full-service microfilm facility, take heart. Several NDNP awardees are in your situation and do the inspections themselves.
- If want to do it yourself, we recommend an x-rite 361T densitometer and Peak Shop Micro (1DIV1)= 0.0005 microscope to get you started.
KY-NDNP's production needs have changed a great deal since we were the only in-house production awardee in 2005. Back then, to produce the NDNP data we licensed the iArchives suite of applications with our internal production infrastructure. From that experience we developed with iArchives the Hosted/Hybrid production model in use today.
Hosted/Hybrid allows us to manage every aspect of production just as before (SEE: Workflow) while applying our expertise only where it is most needed. The work is still achieved using the iArchives suite of applications, but we're now employing the robust iArchives infrastructure in Utah instead of the unsuitable-for-production infrastructure in Kentucky. This relieved us of the heavy technological and administrative maintenance (inherent in managing an army of student employees) so that we now make higher quality data output.
We scan all of the microfilm here in the Digital Library Services lab using a NextScan Eclipse 300 microfilm scanner. The device is capable of creating uninterpolated images up to 400dpi, TIFF 6.0 files as required by NDNP specifications. It utilizes the latest ribbon scanning technology for faster, more accurate processing.
Once a reel has been scanned, we perform a quality control review of the files, checking that all pages on the microfilm are digitally present and uncorrupt in any way.
We then transfer the files to three key locations; one copy is made to an off-site spinning disc storage array server; one copy goes to an off-site automated tape service made free to us through the University; and a third copy is made to a Western Digital My Book Studio Edition II external hard drive which is then shipped to our processing partner, iArchives.
Once the raw scans are ingested into the iArchives system, we monitor the workflow, perform image quality control after crop, deskew, and page splitting steps, and apply all page-level metadata. (SEE: Workflow) From there, the system automatically produces the data sets that are then compiled into a batch by our project manager counterparts at iArchives and shipped to us here in Kentucky.
Once the batch data arrives in Kentucky, we quality control the entire package before shipping to the Library of Congress. (SEE: Deliverables)If there is a problem, we can generally fix it in the METS XML. Having a programmer on staff means we can also fix problmes in all image headers. Only those problems across a mass of data requires reprocessing in Utah. This almost never happens, but when it does, it is usually contained within a reel and can easily be FTP'd to us after reprocessing, then slipped into the batch.
DLS uses a combination of Windows and Mac OSX computers. The Windows machines are Dell Intel Dual Core desktops with firstname.lastname@example.orgGHz processors, 2G of RAM running Windows XP Pro with Service Pack 3. These machines do the lion's share of our work, from the scanner to the iArchives applications to the Library of Congress' Digital Viewer and Validator (DVV).
All of our source document imaging (color), quality control, and media production uses MAC desktop and laptop machines running OSX with 4G of RAM.
To be clear, we do not endorse Dell or Apple computers. Neither are required for any aspect of the newspaper digitization program. However, to date, the DVV will not run on a Mac platform, though there is speculation that it will function with Linux. We cannot confirm this through our testing.
The first thing that should be said about infrastructure is this: whether you're doing the work in-house or outsourcing, you need ample infrastructure. Our general rule is to calculate what you need, then double it! It doesn't matter how much it is, rest assured you'll need twice as much of everything when it's all said and done.
KY-NDNP uses two storage/back-up solutions. As previously mentioned, we have an 20TB (and growing) off-site storage array server. This server does no production work and is purely for storage. We also use an off-site, robotic tape storage system to house the unprocessed (raw) images and all deliverable data sent to the Library of Congress. The University offers this free tape service for campus-wide projects like KY-NDNP that require secure mass storage.
In addition to Chronicling America, the University of Kentucky Libraries, under contract with the Kentucky Virtual Library (KYVL) and the Kentucky Council on Post-Secondary Education (CPE), has developed, managed and coordinated the Kentucky Digital Library (KDL) since 1999. This program provides digital access to resources that document the history and heritage of Kentucky. The KDL provides local access and preservation of Kentucky's historic newspapers created as part of NDNP but also those many titles that are not available through Chronicling America. (SEE: Digitized Newspapers) The historic newspapers are the largest of the KDL collections and overwhelmingly the most accessed, bar none. KDL is a blacklight interface repository that employs micro-services for the curatorial functionality.
- If you are not interested in doing the technical inspection yourself, you can request your microfilm duplicating company provide density readings from the print master negative (as well as the master). Most companies charge handsomely for the service and it may ultimately be cheaper to do it yourself.
- FACT: A microfilm duplicating technician must take density readings from each master negative reel to properly duplicate a print master negative. How many readings depend on the technician, but some duplicating companies charge extraordinary amounts to provide density readings even though they are a natural part of the process.
Microfilm evaluation is one of the most critical steps in the microfilm-to-digital process. It can disqualify a reel for digitization, but mostly, it prevents mistakes in digital processing. There are two key components in evaluating microfilm:
The technical film inspection is taken from the master negative reel. This includes...
- technical inspection of the microfilm (this usually happens after title/reel selection)
- intellectual analysis of the content (this is usually taken from the print master)
Resolution readings are only possible when a resolution technical target is included on the film. Much of the "pre-standards" film don't include this technical target and, without it, resolution can only be measured by one's opinion; it's either in focus or it's not, with varying degrees in between.
- taking resolution and density readings
- inspecting the film for defects (redox, silvering, etc.)
- replacing splices as needed
Like reels with imperfect resolutions, reels with less-than-stellar density readings can also be discarded or they can be successfully digitized. Why does the density matter? Because a reel with good density will produce better, crisper, more legible digital images. (SEE: Infrastructure) Should you throw out a reel, or a whole title, because of bad density readings? KY-NDNP certainly didn't. Consider the ability to manipulate density during the duplication process for better results. There are three ways to change or improve the density of a print master:
- On some older films, microfilm vendors may append target sets as a matter of ownership claims. These targets sometimes include a resoution target. Do not rely on non-native targets for resolution or creation provenance.
- Short of refilming the content, nothing can be done to improve resolution of a piece of microfilm once it has been made.
- Microfilm will suffer some resolution loss with each duplicative generation. In other words, if the resolution is bad on the master negative, it will be slightly worse on the print master or positive.
- A reel that doesn't resolve well can be eliminated from digitization... or not. It depends on what you're willing to accept.
Digital Library Services (DLS) replaces inferior splices because they can fail during the duplication process and rip the film, and because they create a sort of "speed bump" during duplication causing blur of the print master. Replacing older glue, weld, or ultrasonic splices with flatter, more secure, less caustic tape splices allow problem-free duplication and, ultimately, improve the preservation of the master negative, not to mention the legibility of countless digital page images taken from print masters.
The basics of intellectual evaluation are rather simple. While reviewing a reel of microfilm in a reader, the newspaper issues and pages, their order, and missing content are recorded. Noted, too, are anomalies like supplements or additional titles that could affect the digital output. Keeping track of such facts is important:
- The duplicating technician can change the lamp and speed settings of a duplicating machine to increase or decrease exposure from the master negative onto the print master
- The type of microfilm can play an important role in improving the print master densities by choosing high-contrast or low-contrast microfilm stock
- Where you take the density readings throughout the reel can have a great effect on the "average". The average density helps determine the duplicator settings and that can make all the difference in good print masters for scanning. It has been said that a technician can find any density reading on a piece of film if they're willing to look for it. It's true.
To keep up with our evaluation information, KY-NDNP developed the Newspaper Evaluation Database (NED). NED is a MySQL database coupled with a powerful web interface allowing seamless entry and recall of select newspaper information from anywhere on the globe with a connection. It was first developed in 2006 for KY-NDNP use only, and it was specific to newspapers on microfilm. NED has since expanded, now offering functionality for other archival materials, and development is a constant. NED is now available to other NDNP awardees. (SEE: Metadata)
What KY-NDNP records in NED may be different than what you want to keep. Every microfilm-to-digital newspaper operation is different. What you collect, or don't, are choices you have to make. Our only word of caution is that there is a point of diminishing returns when collecting intellectual information.
A final point to be made about evaluating microfilm lies in understanding microfilmers and how newspapers are filmed. Certain patterns can be observed during the evaluation process that will help answer questions. There are patterns of technique indigenous to the microfilmer - this is especially true in pre-standards film, patterns within the newspapers (ads, pagination, etc.), and even patterns in the way newspapers have been bound. Recognizing patterns can be to one's advantage. The key, like most everything else in newspaper digitization, is attention to detail.
- to ensure that the microfilm scanner has captured everything present on the microfilm reel
- to consistently express in the metadata what is on the microfilm for accurate digital output
Metadata can be the scariest part of the NDNP process but it doesn't have to be. It's just a matter of getting used to the rules and looking at what can feel like endless lines of text and tags. "Don't Panic!" It all boils down to data about data.
NDNP follows the Library of Congress METS (Metadata Encoding and Transmission Standard) schema. The METS schema is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium... The following are examples of KY-NDNP data a la METS. Samples from vendors may look slightly different but the basic premise is the same.
Note the use of XML in the above examples. The major advantage of using XML, not only in NDNP, is that it's usually very intuitive - < title>Kentucke Gazette< /title>; validators catch many errors (such is the case with the Library of Congress DVV); and there are a lot of tools built with it, such as EAD, TEI, or XHTML. Below are examples of descriptive, administrative, and structual XML metadata in NDNP.
For KY-NDNP, metadata collection starts with the Microfilm Evaluation. It is here that we gather, in one place, all of the information we will need about the title, the reel, and the newspaper (descriptive, administrative, and some structural). For our in-house production, we imported this metadata from NED into the processing system. In our Hosted/Hybrid model, we export the metadata from NED as a .csv file then, simply upload that file using iArchives online "Dashboard" portal. The metadata is then distributed to the necessary image headers and XML components during processing. If you're outsourcing to a vendor other than iArchives or you aren't using NED, your vendor will ask that you enter this same information into a spreadsheet (probably Excel) that they can then import into their system to produce the same deliverables.
Metadata can be found in a variety of places, not just from the evaluation of the newspaper. It can come from the MARC record, microfilm box, targets on the film, on the film itself (such as the film stock and manufacturer). We've even managed to find metadata from decades old filming log books. For example, "date filmed" 6/65 may be transposed into the reel metadata as:
< ndnp:dateMicrofilmCreated >June 1965< /ndnp:dateMicrofilmCreated >
Not all metadata we collect is used in our NDNP deliverables. We collect some information purely for our own purposes, such as mutilated pages. Other information is used for our KDL deliverables, such as publisher. The bottom line is that collecting too much can stop production. Not collecting enough, and you may lose important provenance information. (SEE: Microfilm Evaluation)
The workflow for newspaper film-to-digital production may vary slightly from vendor to vendor, for in-house productions, even a Hosted/Hybrid production situation. The beginning and end procedures may also be different for each content producer. Regardless of who is doing the work, the basic digitization processes are exactly the same for everyone. To give you a better idea of what we mean, see the workflow chart from our in-house production days, and this graphic that describes our workflow for the Hosted/Hybrid model.
During our first two award cycles we had four staff positions involved in the workflow (not including administrators): Program Manager, Office Manager, Image Management Specialist (IMS), and Student Workers (part time). Under our third award using the Hosted/Hybrid solution, the IMS and student positions were condensed to a single full-time staff position. For many NDNP awardees, this workforce is scaled down further to a Program/Project Manager/Coordinator. These programs typically outsource all of the production save for the intellectual microfilm evaluation and the data QC from their vendor.
Many ask about our Program Manager position and all that entails. This staffing graphic roughly explains both KY-NDNP production models.
Choosing the Titles and Microfilm
The KY-NDNP program team along with the Advisory Board chooses the titles to be digitized. (SEE: Selection) We pull from our climate controlled vault the reels to be duplicated for scanning. (SEE Infrastructure) The technical evaluation of the microfilm takes place prior to the duplication. The print master (2N) densities are, of course, assessed once the 2N is made. (SEE: Microfilm Evaluation) Note that density readings from the master and print master are no longer NDNP requirements.
Evaluating and Scanning Microfilm
The intellectual evaluation of the newspaper, sometimes known as collation, takes considerably more time than technical inspection but, it's extremely important to do before scanning. Scanning technicians don't always have the luxury of working with modern, standards-made film. The evaluation results will help them first understand what to expect during a scanning session with regard to technical difficulties and content output. For example, are pages straight on the camera bed? This could cause skew problems. Are pages evenly spaced or are there foreign objects in frames? These factors can create scanner detection mistakes. All of these issues and more can impact production and costs.
Ingest, Crop, Deskew, and Split
The scanner's image files, or raw scans, must be processed for aesthetic correction, metadata conjugation, and deriviative generation. To do this, raw scan files are ingested into a digitization system typically made of multiple conjoined applications. For most vendors, the majority of the applications rely on automated functionality. But not all. For example, an automated system may first try to crop, split, and deskew page images. To verify the system has preformed such tasks properly, a human must look at each image. From there, most systems allow for the manual manipulation of the images to create the desired end product. For instance, 2 up images (two pages filmed side by side in the same exposure) can be split so each page becomes a digital object; badly skewed images can be further corrected for straighter columns (a factor in OCR output); and over-cropped and under-cropped page images can be modified for a more desirable aesthetic. These three steps must be performed first in the digitization processing to avoid processing repercussions later.
When the images have passed Quality Assurance, sometimes called QA or QC (Quality Control), they're ready to be coupled with the appropriate metadata. Here, page numbers, edition, section, and issue dates are applied. This metadata will be applied to all derivative files. To our knowledge, this information, despite the vendor, must be entered by humans (humans who are following the microfilm's intellectual evaluation). All major newspaper digitization vendors have a system by which their clients input/upload the metadata that their systems can then extract. (SEE: Metadata) NOTE: Metadata associated with a reel is typically applied automatically as part of post processing derivative file generation.
Zoning for OCR
The next step is to build the columnar boxes, or "zones" to establish read order. This once manual step is now automated by most vendors. Though not accurate 100% of the time, the cost savings make it an alluring alternative to vendor and client alike, and development continues to improve its accuracy. Once this step is complete the OCR (ALTO for NDNP) can be generated, which is itself an automated feature.
Final QA and Metadata Import
The final Quality Assurance step looks at all facets of the digital page images; crop, deskew, split, metadata, and column zone boxes. Once any outstanding issues have been resolved, it's simply a matter of inserting the reel level metadata gathered in the microfilm evaluation form and pressing "go".
The Final Product
The images go through final automated processes that generate and assemble the data into a batch (batch, here, is defined as a set of data packages prepared for delivery to the Library of Congress per NDNP specifications). Derivative files are generated (PDF, JP2 [JPEG2000], XML) with the associative metadata, and then sorted into a tidy issue, reel, and LCCN directory structure.
One final note, every step of the KY-NDNP program is documented using our internal wiki. Not enough can be said for the organizational relief our wiki offers. We track reel and title through every step of the microfilm-to-digital process.
All NDNP batches are delivered according to a predefined order.
- Each batch is assigned a uniform name (of sorts) such as batch_ky_20060803_nirvana; where ky signifies the awardee; 20060803 is the date of batch validation; and nirvana (N) is the fourteenth batch of the respective program phase
- Each LCCN has its own directory and...
- Inside each LCCN directory are the reels - or titles from a reel - associated with the LCCN. The reels are identified by the bar code that is provided by the Library of Congress and assigned by the KY-NDNP Program Manager
- The titles are then broken down into issue containers/directories with the reel's targets placed within the reel container itself
- a typical directory tree
We send our batch data to the Library of Congress on external hard drives. They keep those drives for 3-6 months. Meanwhile, we continue to produce data for delivery. Needless to say, we have accumulated a lot of hard drives over the years! To keep up with them, and to have a bit of fun, they're each named for a thoroughbred race horse (this is Kentucky after all - Horse Capital of the World). We learned the hard way that some hard drive brands couldn't withstand the rigors of travel and constant access. Today, we use only Western Digital drives - 1-2TB - from the MyBook Studio and Mirror series. Some drives are raided and some are not.
It's not just the drives we have to keep up with but the batches, too. Just like the external hard drives, we come up with a naming scheme for each batch and a theme for those names with each award cycle. The name we give a batch stays with it for its lifetime. The data, then, can be traced back to a batch and the drive it was delivered on. The KY-NDNP Phase I batches were named for musicians and delivered in 14 batches; Abba - Nirvana (A-N). Phase II delivered 9 batch drives named for one word movie; Airplane-Ishtar. Phase 3 batchs ran from very real metal, Aluminum, to comic created, Kryptonite. Have a look at our funny lists if you dare!
Again, we keep good documentation of every aspect of the KY-NDNP program through our internal wiki.
Wondering how to calculate line pairs, capture resolution, or True DPI? How about determining reel generation or estimate pages per reel? The KY-NDNP Quick Guide can help!
Learn more about KY-NDNP's technical workings or discover more about NDNP.
© 2014 University of Kentucky Libraries
Website Design by Kopana