2. Global
Encoding Common to Every Document
2.1 Header
2.2 Unique Identifiers
2.3 Basic Document Structure
2.4 Titles and Naming
2.5 References to External Files (Page Breaks and Entity Declarations)
[Note: Above the Header of each document are the XML Declaration, Document Type Declaration, External Entity Declarations, and the open tag of the "root element," TEI, which contains all other elements. Please go to the Annotated Template or to section 2.5 to read more about how to insert these.]
2.1 The Header
Every XML document we create has a "header," which carries essential information about who is responsible for creating and publishing the document, the source of the text we are marking up, and kind of electronic title page. The header is analogous to a book's first few pages, which inform you of the author, publisher, copyright date, terms of publication, etc.
Since much of the information in the header is the same for all of the XML documents we create, we recommend that you use the template to simplify your encoding of it.
Below, you will find descriptions of the main parts of the header, and you can click here to consult an annotated version of the template.
The <teiHeader> has three principal components:
- <fileDesc> contains a full bibliographic description of an electronic file
- <profileDesc> provides a detailed description of non-bibliographic aspects of a text, specifically the situation in which it was produced, the participants, and their setting
- <revisionDesc> summarizes the revision history for a file
These elements are arranged within the <teiHeader> in this order, so the overall structure of <teiHeader> is this:
<teiHeader>
<fileDesc></fileDesc>
<profileDesc></profileDesc>
<revisionDesc></revisionDesc>
</teiHeader>
File description<fileDesc>
This should contain the following components:
- Title statement <titleStmt> includes 1) the title given to the electronic work (which here always includes the subtitle provided by us: "a machine readable transcription"); 2) the author; 3) the editors; 4) information about others responsible for aspects of the electronic text; and 5) the name of the sponsors and funders. An example in which the original document bears a title given by Whitman:
<titleStmt>
<title level="m" type="main">Song of Myself</title>
<title level="m" type="sub">a machine readable transcription</title>
<author>Walt Whitman</author>
<editor>Ed Folsom</editor>
<editor>Kenneth M. Price</editor>
<respStmt>
<resp>Transcription and encoding</resp>
<name>The Walt Whitman Archive Staff</name>
</respStmt>
<sponsor>The Institute for Advanced Technology in the Humanities</sponsor>
<sponsor>University of Iowa</sponsor>
<sponsor>University of Nebraska-Lincoln</sponsor>
<funder>The National Endowment for the Humanities</funder>
<funder>The United States Department of Education</funder>
</titleStmt>
An example for a manuscript that lacks an authorial title (to read the guidelines for assigning titles, click here.):
<titleStmt>
<title level="m" type="main" rend="bracketed">I see who you are</title>
<title level="m" type="sub">a machine readable transcription</title>
. . .
</titleStmt>
etc.
Note that the title element includes a rend attribute that indicates it has been supplied by us and should therefore be displayed with brackets.
<editionStmt>
<edition>
<date>2005</date>
</edition>
</editionStmt>
<publicationStmt>
<idno>uva.00023</idno>
<distributor>The Walt Whitman Archive</distributor>
<address>
<addrLine>The Institute for Advanced Technology in the Humanities</addrLine>
<addrLine>Alderman Library</addrLine>
<addrLine>University of Virginia</addrLine>
<addrLine>P.O. Box 400115</addrLine>
<addrLine>Charlottesville, VA 22904-4115</addrLine>
<addrLine>[email protected]</addrLine>
</address>
<availability>
Copyright © 2005 by Ed Folsom and Kenneth M. Price, all rights reserved. Items in the Archive may be shared in accordance with the Fair Use provisions of U.S. copyright law. Redistribution or republication on other terms, in any medium, requires express written consent from the editors and advance notification of the publisher, The Institute for Advanced Technology in the Humanities. Permission to reproduce the graphic images in this archive has been granted by the owners of the originals for this publication only.
</availability>
</publicationStmt>
<sourceDesc>
<bibl>
<author>Walt Whitman</author>
<title>Calamus Leaves</title>
<orgName>Yale Collection of American Literature, Beinecke Rare Book and Manuscript Library</orgName>
<note type="project">Transcribed from our own digital image of original manuscript.</note>
</bibl>
</sourceDesc>
Note on <orgName>: The institution that holds the manuscript should be cited as listed in the Preferred Citation table in the References section of the Encoding Guidelines.
Notes on description of source: This information is about the copy text, and the <title> here (as opposed to the one in titleStmt) should be given exactly as it appears in the records of the institutional repository, no matter how imprecise or wrong-headed their conventions may seem. Many times, the most specific title for the material will be that given to the folder used to store it, since few archives assign a title to each individual item; often, therefore, the <title> given in the <sourceDesc> will be a folder label.
At present, we almost always work from our own digital images, but we have also worked from Joel Myerson's facsmile reproductions of Whitman manuscripts (published in Joel Myerson, The Walt Whitman Archive: A Facsimile of the Poet's Manuscripts, New York: Garland, 1993.); from the Primary Source Media Whitman CD (Major Author's on CD-ROM: Walt Whitman, Eds. Ed Folsom and Kenneth M. Price, Woodbridge, CT : Primary Source Media, 1997); or from the original manuscripts themselves. Whatever the case, specific information about the image(s) and/or text(s) you rely on should be given in a <note>. If you consult more than one thing, list each, separated by semicolons. (Please note that when citing Myerson, the volume #, part #, and page # change from manuscript to manuscript.)
By the way, we say "our own digital image" rather than, say, the Whitman Archive's digital image so as to draw a clear distinction with Myerson's volumes, also called—somewhat confusingly—the Whitman Archive.
In the <profileDesc> is a list of all hands other than Whitman's that the markup declares as being in any way responsible, typically as the value of a "resp" (or "responsibility") attribute in a note, unclear, or gap element.
For example, if you are transcribing a Whitman manuscript that has a note by Fredson Bowers written physically on it, the header must have a <profileDesc> that reads:
<profileDesc>
<handList>
<hand scribe="Fredson Bowers" id="fb"/>
</handList>
</profileDesc>
(For more on this topic and how to encode non-Whitman writing on manuscripts, see section 3.10, "Writing in Others' Hands".)
You also need to include a <handList> in the <profileDesc> if your markup includes any <unclear> or <gap> elements, which require a "resp" attribute. For example, if Andy Jewell is encoding a manuscript with an unclear word and inserts this markup:
<unclear reason="cut away" cert="60%" resp="awj">herbage</unclear>the document's <teiHeader> will need to include this <profileDesc>:
<profileDesc>
<handList>
<hand scribe="Andrew Jewell" id="awj"/>
</handList>
</profileDesc>
The revisionDesc element is used to summarize the changes that have been made to the file. It contains date, respStmt, name, and item elements to specify the date, responsible individuals, and changes. IMPORTANT: TEI allows only one <item> per <change>. If changes are performed at the same time, insert additional changes within the same <item> and use semicolons. If multiple changes are performed at different times, add another <change> at the top, so that changes are listed in reverse chronological order (most recent change first). To describe the tasks in our routine workflow, choose from the following terms for the content of <item>:
If the task is something other than these, any descriptive phrase can be used. Example:
<revisionDesc>
<change>
<date>2002-10-30</date>
<respStmt>
<name>Brett Barney</name>
</respStmt>
<item>Converted to camel case</item>
</change>
<change>
<date>2002-09-14</date>
<respStmt>
<name>Kenneth M. Price</name>
</respStmt>
<item>Edited</item>
</change>
<change>
<date>2002-09-07</date>
<respStmt>
<name>Andrew Jewell</name>
</respStmt>
<item>Checked; revised</item>
</change>
<change>
<date>2000-08-22</date>
<respStmt>
<name>Matt Miller</name>
</respStmt>
<item>Transcribed; encoded</item>
</change>
</revisionDesc>
Examples:
loc.00158 (a manuscript at the Library of Congress)
uva.00001 (a manuscript at University of Virginia)
Printed texts are all assigned the 3-letter prefix "ppp."
We use a database to track the unique identifiers and our workflow as we transcribe, encode, and upload manuscripts. This database can be accessed here.
<TEI.2 id="uva.00001">
<publicationStmt>
<idno>uva.00001</idno>
uva.00023.xml
loc.00158.002 (Page 2 of a manuscript)
These page image IDs are inserted as the value of the corresp attribute of the appropriate page break elements (<pb/>), and an entity declaration for each one must be inserted between the square brackets in the document type declaration. Example:
. . .
<!DOCTYPE TEI.2 PUBLIC "-//UVA::IATH//DTD whitman.dtd (Whitman Archive)//EN" "whitman.dtd" [
<!ENTITY uva.00023.001 SYSTEM "uva.00023.001.jpg" NDATA jpeg>
<!ENTITY uva.00023.002 SYSTEM "uva.00023.002.jpg" NDATA jpeg>
]>
. . .
<pb corresp="uva.00023.001" />
. . .
<pb corresp="uva.00023.002" />
Within the <text> of each encoded document is a structured description of the content of the item being encoded. This page describes the basic elements of this structural tagging.
The following elements are used to describe the structure of Whitman's poetic works:
A sample structure might look like this:
<!-- markup is simplified -->
<div1 type="poem notes">
<lg1 type="poem">
<head type="main-authorial" rend="underline"></head>
<l></l>
<l>
<seg></seg>
<seg></seg>
</l>
</lg1>
<lg1 type="poem">
<head type="main-derived"></head>
<lg2 type="linegroup">
<l>
<seg></seg>
<seg></seg>
<seg></seg>
</l>
<l></l>
</lg2>
<lg2 type="linegroup">
<l></l>
<l></l>
</lg2>
</lg1>
<p></p>
</div1>
<!-- markup is simplified -->
<text type="manuscript">
<body>
<lg1 type="poem">
<l>There is no word . . .
<!-- markup is simplified -->
<text type="manuscript">
<body>
<lg1 type="poem">
<lg2 type="linegroup">
<l>
<seg>[line segment here]</seg>
<seg>[line segment here]</seg>
</l>
. . .
</lg2>
<lg2 type="linegroup">[lines and segments here] </lg2>
</lg1>
</body>
</text>
<!-- markup is simplified -->
<text type="manuscript">
<body>
<div1 type="multiple poems">
<lg1 type="poem">[poem here]</lg1>
<lg1 type="poem">[poem here]</lg1>
</div1>
Prose should be divided into <p>s. No <div> is required in a prose-only document unless the prose is divided into separate intellectual units. For example, a manuscript requires <div1 type="section"> if it begins with two paragraphs about democracy, then has a clear break (e.g., a sub-heading, a horizontal line, or white space) followed by three paragraphs about the sound of the fishmonger yelling on the street. In such a case, the discreet groups of paragraphs should be marked with <div1>s. Except on title pages, line breaks <lb/> are not encoded. Also note that <lg>s are only used to markup poetry, never prose.
<!-- markup is simplified -->
<text type="manuscript">
<body>
<div1 type="section">
<p></p>
<p></p>
</div1>
<div1 type="section">
. . .
</div1>
</body>
</text>
Many manuscripts contain single intellectual units which are a mixture of poetry and prose. (For an example, see the manuscript "Ashes of Roses," here.) "Mixed genre," for our purposes, does NOT just mean a manuscript leaf with poetry and prose on it (for example, a poetic draft on the recto and prose on the verso). Rather, "mixed genre" signifies writing that is thematically unified, apparently part of a single draft, but made up of a mix of prose and verse, as when Whitman composes an early draft that combines trial poetic lines with prose notes or lists. For a mixed-genre manuscript, use a <div1> with "poem notes" as the value of the "type" attribute, like this:
<!-- markup is simplified -->
<text type="manuscript">
<body>
<div1 type="poem notes">
etc.
Some manuscripts have only titles, with no content to follow those titles, or are pages with several trial titles that Whitman never used (for an example, click here). For these unusual manuscripts, we have a different <div1> type, "title notes."
To read about the unique markup used in Title Page manuscripts, go here<!-- markup is simplified -->
<text type="manuscript">
<body>
<div1 type="title notes">
etc.
Each poetry manuscript transcription will have three different kinds of titles. These titles may be identical; they may be different.
This title, which occurs inside the TEI header, names the electronic file you are creating and should therefore be distinguished from the title of the source material. Do this by adding the phrase "a machine readable transcription" as a subtitle, as in the following example.
<titleStmt>I f you are transcribing and encoding a manuscript that does not have a title written on it, derive a main title from the first line, as described below. Also, include the attribute rend with the value "bracketed," to signify that the title is one we have assigned based on the first line and should therefore be bracketed when displayed. An example:
<title level="m" type="main">Death dogs my steps</title>
<title level="m" type="sub">a machine readable transcription</title>
</titleStmt>
<titleStmt>F or manuscripts that contain more than one poem, follow the above procedure for each poem, but for the value of level use "a" (which indicates individual items within a larger item). Then wrap all of these individual titles in another <title level="m">. The following example imagines a manuscript with two poems, the first of which Whitman has given a title and the second of which he hasn't.
<title level="m" type="main" rend="bracketed">And to me each minute of the night and day is vital and visible</title>
<title level="m" type="sub">a machine readable transcription</title>
</titleStmt>
<titleStmt>
<title level="m">
<title level="a" type="main">Title Written on Manuscript</title>
and
<title level="a" type="main" rend="bracketed">Title derived from first line</title>
</title>
<title level="m" type="sub">a machine readable transcription</title>
</titleStmt>
This title is the one given to the artifact by the holding institution. The <sourceDesc> is essentially a bibliography of information that should be sufficient for a user to locate the item that is the source of the transcription. If the title is bracketed in the online repository guide, you should bracket it in the <sourceDesc>. In some cases, the "title" in the <sourceDesc> may bear little relation to the poem—for example, it might be the title of the folder which holds the item rather than the title of the item itself (this is typically the case only for the Feinberg collection at the Library of Congress).
This is the location to which the stylesheet will go to pull titles of poems, etc. for indexing and display. For more detailed rules on how to formulate these titles, please see the section below. The rules for the use of <head>:
— main-authorial (written by Whitman on the page)
— main-derived (assigned by us, derived via the formula for titles (see next section below)
— sub (subtitle written by Whitman on the page; "sub" is only used for secondary authorial titles and must be preceded by <head type="main-authorial">)
- In the <titleStmt> title, use the final reading of the title, disregarding deleted passages and including added ones. For instance, for this example, the <titleStmt> title would read:
<titleStmt>
<title level="m" type="main">Ah, not this granite dead and cold.</title>
<title level="m" type="sub">a machine readable transcription</title>
</titleStmt>- In the <head>, encode all the additions, deletions, and substitutions as such. For the same example, the <head> would be encoded like this:
<head type="main-authorial" rend="underline">
<app>
<rdg varSeq="1">
<del type="overstrike">Beyond this </del>
</rdg>
<rdg varSeq="2">
<add type="unmarked" place="supralinear">Ah, not this </add>
</rdg>
</app>
granite dead and cold.
</head>
- If a manuscript is not titled by Whitman, in the <titleStmt> use the first words not struckthrough, and go up to (but do not include) the first punctuation mark OR the end of the line OR the end of the segment, WHICHEVER COMES FIRST. For <head> follow the same procedure and assign the attribute type="main-derived."
- For poems with recurrent titles (like "Leaf"), use the title AND, in brackets, the title derived from the first line. So: Leaf [A promise to Indiana]
- Don't worry if two poems have the same title. Our unique identifier for the document will enable us to locate the correct document for processing through stylesheets.
2.5 References to External Files:
Note: The procedure described here applies to both printed and manuscript materials, even if the manuscript is only one page long. Entity Declarations and Page Breaks
There are two steps to linking a document to its corresponding image:
- First, declare the name and location of each image in an "entity declaration" right before the <TEI2> tag. In the sample below, the identifier following "!ENTITY" is the string that will be inserted within a <pb> tag; the string following "SYSTEM" is the name of the image file that corresponds with this identifier. Together, the two strings basically to tell the computer how to find the pictures.
Sample:
<!ENTITY loc.00001.001 SYSTEM "loc.00001.001.jpg" NDATA jpeg>
<!ENTITY loc.00001.002 SYSTEM "loc.00001.002.jpg" NDATA jpeg>
<!ENTITY loc.00001.003 SYSTEM "loc.00001.003.jpg" NDATA jpeg>
etc.
- PLEASE NOTE:
- Entity names (the first string) should NOT include the file extension.
- Be sure the file name which follows SYSTEM includes the dot before the file extension.
- Be consistent when giving file extensions—we've chosen to always use ".jpg" rather than ".jpeg" or ".jpe," for example.
- The last portion of the declaration, "NDATA jpeg" should always be noted in that way. This does not contradict the rule just above.
- Second, include a pb element BEFORE the content of each page, and in the corresp attribute give to the name of the entity you've declared at the top of the document:
<text><body>
<pb corresp="loc.00001.001" id="leaf01r" type="recto"/>
<lg1><l>etc.PLEASE NOTE:
- Do not include the file extension here.
- The "type" attribute is required on the <pb> element. Choose between "recto" and "verso."
- ID's must be unique within each document, so for manuscripts that have writing on both the recto and verso and/or that have more than one page, modify the id attribute for subsequent <pb>'s. For the recto of leaf one, use id="leaf01r"; for the verso of leaf one, use id="leaf01v," etc.
- Include the id attribute even on one-page documents.
- For printed works or manuscripts that have page numbers, also include an "n" attribute in the <pb>. So: <pb corresp="ppp.00001.001" id="leaf01r" type="recto" n="1"/>