You are here

Digitizing Textual Works to Images (D4.e)

Description

This document presents factors to consider when digitizing materials that contain text. Many of the factors discussed apply also to materials consisting mostly of distinct lines with or without text such as line art, blue prints, and maps. For born-digital textual objects (teat do not need digitization), see the document "Submitting Born Digital Objects."

Definitions

"Textual works" can be broken down into two categories.

Typed or Printed Text

These are works of 100% text in which the text has a distinct edge representation, with no tonal variation within the characters. This includes printed texts (from offset presses and computer printers). Other materials that sometimes have the same characteristics are music scores, fine line drawings, woodcuts, and printed maps. Color and grayscale targets not needed with digitizing such works.

Newspapers

Images of pages from a newspaper often contain extraneous markings that are the result of blemishes in the original such as creases, blotches, and stains. These can be removed by using de-speckling. This produces a background with less non-textual markings. If de-speckling is done before OCRing the text, the OCR will be performed more accurately and require less manual manipulation to remove the blemishes.

Manuscript Materials

These are works in which the text (or lines) were hand-produced. It includes hand-written letters and notebooks as well as blurry text, line drawings, blue prints and hand-drawn maps, so long as they were produced manually.

View the diagram: Digitizing Printed Text - Recommended Specification.

View the diagram:  Digitizing Manuscript Materials - Recommended Specification.

(Reviewed: September 27, 2008)