You are here

Text Encoding

Text Conversion

A computer cannot directly search for text in a scanned image. The text in an image first has to be extracted and stored in a separate character-based file. This process, which is called OCR (optical character recognition), works quite well on documents that contain consistently shaped letters like those found in typed or type-set documents. However, the OCR process does not work at all well on hand-written materials. Consequently, we had to have a human transcribe the letters and words manually into text. We outsourced this transcription process to a commercial firm called "Access Imagery" for our first project, but subsequently have used student workers to do it.

Text Encoding

Transcribing a document results in a file containing all the words of a document, but in order to preserve the visual layout of the original, special encoding (called "markup" or "tagging") of the text is required. In a properly encoded text document, the layout will mimic the paragraph structure, line breaks, and pagination of the original document.

We chose to have the transcription of the materials in the Civil War collection encoded using the TEI encoding scheme (see sidebar) because it is a widely-used means of preserving the layout of the original document and for semantic and syntactic tagging of manuscript materials. Because we encoded names of people and places that appear in these transcriptions a user can generate lists of names of people and places from a single document or across multiple documents.

We also tagged any words that were misspelled or undecipherable so the reader would not think a misspelling was the fault of the transcriber. We could also have tagged other details mentioned in the manuscripts such as all military terms, domestic terms, or obsolete vocabulary, but we did not expect our primary users to be requesting lists of such terminology.

The Text Encoding Initiative (TEI)

TEI is an encoding scheme used in the scholarly, archival and cultural heritage communities for marking up textual information. TEI tags describe the structural hierarchies, divisions, characteristics, and content of a given electronic document.

Key TEI Collections

TEI Resources


TEI Tags Used in this Project

A TEI-conformant text contains a "header" section and a "body" section.

The TEI Header

The TEI header tag (<teiHeader>) provides descriptive metadata about the document in a way that is similar to Dublin Core, but in this project we chose to put the descriptive metadata in a separate XML file in unqualified Dublin Core which points to the TEI file, which contains the encoded transcription of the document. The TEI header contains only a minimal set descriptive metadata. [The teiHeader can hold more than descriptive metadata, but in this project it only contains a few descriptive metadata elements.]

Example of a <teiHeader> in this Project

<teiHeader> <fileDesc>
  • <titleStmt>
    • <title>Letter of George W. Pearl dated 1862-09-07</title>
    • <author><persName><surname>Pearl</surname>, <foreName>George W.</foreName></persName></author>
    • <respStmt><resp>Transcribed and tagged by </resp><name><orgName>Access Imagery, Inc.</orgName></name></respStmt>
  • </titleStmt>
  • <publicationStmt>
    • <publisher><orgName>Hamilton College Library</orgName></publisher>
  • </publicationStmt>
  • <sourceDesc>
    • <p>George W. Pearl Letters Collection, #04000</p><p>Transcribed from digital images of the original manuscripts.</p>
  • </sourceDesc>
</fileDesc> </teiHeader>

The TEI Body Tag

The TEI body tag (<body>) wraps the actual text of the document and may consist of a wide range of tags that wrap or specify kinds of layout structure, semantics, or syntax.

Basic Document Structure TEI Tags

These tags are used to retain structural equivalence between the transcription and the original.

  • <div#> - division with optional number
  • <pb n="#"> - page break with optional page number
  • <p> - paragraph break
  • <lb/> - line break
  • <space> - indented space

TEI Tags Used in the <body> of Transcriptions in this Project:

  • abbr expan=, div, div1, date, gap/, name, note place=, note, p, pb, sic corr=, title, unclear, xref

Tags from the "Additional Element Set for Names and Dates" (teind2.dtd)

  • addName, country, distance, foreName, geog, geogName, offset, orgName, persName, placeName, region, roleName full="abb", roleName, settlement, surname full="abb", surname

When reading the text was problematic the following editorial tags were used as appropriate:

  • <gap /> for any text that was illegible
  • <unclear> for text that was transcribed with some uncertainty because it was illegible
  • A choose statement when the text is misspelled, a variant spelling, or an abbreviation:
    <choice>
        <sic>text with error</sic> 
        <corr>corrected text</corr>
    </choice>
    
    <choice>
        <orig>text with error</orig> 
        <reg>corrected text</reg>
    </choice>
    
    <choose>
        <abbr>text with error</abbr> 
        <expan>corrected text</expan>
    </choose>
    
    

Entities and Extended Pointers

  • Links to the image file at the beginning of each page:
    <xref doc="[image_file_name]">PAGE IMAGE</xref>

View an example of a TEI file from this project (created in 2004).

Displaying TEI-Encoded Text on the Web (XSL Stylesheets)

We display the TEI encoding texts on the Web by using a stylesheet in XSL format which reads each TEI tag and formats it in XHTML preserving the original's layout and color-coding of the names of persons, organizations, places, and geographic features.

- Top -

(Reviewed: June 15, 2017)