Translation Seminar

Revision as of 19:12, 13 May 2011 by Elorang (talk | contribs) (New page: =Introduction= This wiki is meant to provide information and resources on preparing translations of Whitman's work for online publication to participants of the 2011 Obermann Seminar on W...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


This wiki is meant to provide information and resources on preparing translations of Whitman's work for online publication to participants of the 2011 Obermann Seminar on Whitman Translation at the University of Iowa.

All participants are invited to contribute to and edit this wiki.

Editorial Policies for Whitman translations

(Please note that the following editorial policies are taken from the Whitman Archive's general editorial policy section and may be modified significantly as a result of decisions made at this symposium. The following statement should therefore be treated as provisional.)

For each language, we designate a lead editor, who establishes the goals and scope of the undertaking in consultation with the general editors. For this section of the Archive, we provide page images when possible along with transcriptions. In the transcriptions we do not attempt to capture the so-called bibliographic codes—the appearance of margins, fonts, and ornaments in the original printed documents. Most other features of the printed page are preserved: capitalization, hyphenation, punctuation, and page breaks. The transcription and encoding processes are followed by silent proofreading of the transcription against the original source document. Our electronic transcriptions preserve typographical errors present in the original; to aid in searching, corrected forms are also included in the encoding.

Suggestions for Additional Reading

  • Jerome McGann and Dino Buzzetti, "Electronic Textual Editing: Critical Editing in a Digital Horizon," from the MLA volume, Electronic Textual Editing, available here
  • Allen H. Rennear, "Text Encoding," from the Blackwell Companion to Digital Humanities, available here
  • Amanda Gailey, "A Case for Heavy Editing," from The American Literary Scholar in the Digital Age, available here.
  • James Cummings, "The Text Encoding Initiative and the Study of Literature," from the Blackwell Companion to Digital Literary Studies, available here
  • Mary-Jo Kline and Susan Holbrook Perdue, eds., A Guide to Documentary Editing, Third Edition (Charlottesville: University of Virginia Press, 2008)

Steps to Publication

Regular contact is essential to the success of this collaboration. Please be in touch whenever you begin a new stage of the process or when you have questions.

Identify existing translations

Compile a list of all existing translations in the desired target language. This list is useful for making decisions about which translation(s) to encode first and for planning the structure of the respective translation section on the Archive website.

Choose a translation

Decide which translations to prioritize for transcription, encoding, and online publication. It would be useful if you could post a note describing your choice and plans for editing the translation to the Whitman Archive listserv.

Try to find out if there is an e-text of the translation available, since this would save a lot of time and effort.

Scan documents

In general, scans of documents are preferable to photographs unless the documents are so fragile that they can only be photographed. Be sure that the scans are 24-bit uncompressed color TIFFs with a resolution of 600 dpi, since that is the standard quality we use for all documents at the Archive.

Process images

Please follow the general Archive image processing guidelines.

Transcribe and encode

If there is no e-text version of the translation you want to encode, you will need to transcribe the text from page images by typing it into a text editing program. As a matter of standardization and overall ease, WWA staff members use the application Oxygen when encoding a document, which is why the transcription of a document should also be done with this program. If you are working with an e-text version that you have copied and pasted into the text editor, you will still need to check each letter and punctuation mark carefully, since errors are often introduced into these electronic texts, especially when they were generated automatically.

Encoding is the process of transforming information from one format into another. Specifically, text encoding uses a markup language to tag the structure and other features of a text to facilitate processing by computers.

With regard to the WWA, encoding refers to the transformation of plain-text documents (i.e. transcriptions of Whitman's manuscripts, printed matter, etc.) into XML files, according to P5 standards. For more detailed information on the WWA's encoding specifications and practices, please view the Encoding Guidelines found in the Technological Details section of this wiki and download the template and sample XML files. As with any digital project, encoders should save and back-up their work regularly.

Ideally, transcribing and encoding should be done at the same time.

Validate and upload your XML file

When you have finished a first-pass of the transcription and encoding, you need to validate the document against the Whitman Archive's TEI schema. The schema declares all of the tags (elements) that can be used in a Whitman Archive TEI file, the order and hierarchy in which they can appear, and the kinds of content they can contain. When you validate a document, Oyxgen both makes sure that your file is well-formed (that elements are properly nested and that you haven't failed to close an element that you've opened) and checks your encoding against the Whitman Archive schema to make sure you haven't used any illegal tags or used specific elements in places they are not allowed. To validate the document, click the red check mark icon. Oxygen will immediately check your document. Depending on the outcome of validation, at the bottom of the document you will see either a green square and the message "Document is Valid" or a red square with a message that the document is invalid, along with a list of validation errors. After you resolve each validation error, revalidate the document. Sometimes one error causes several others, so fixing the first error will resolve the others automatically. If you have questions about validation errors or error messages, contact Liz.

Once the file validates, upload it to the Whitman server. Put all correspondence XML/TEI files here: /data/public/whitmanarchive/published/foreign/tei (For instructions on uploading files to the server, see "Downloading, Installing, and Using the WinSCP File Transfer Client.") After you've uploaded the file, make sure the permissions on the file allow other individuals and groups to modify the file. Under the "Rights" column in the server-side pane of the file transfer program, you'll see a series of Rs Ss Ss and -s for each file. These stand for "read," "write," "execute," and if present, the permissions are enabled; the dash means that the read, write, and/or execute permissions are not enabled. Each file carries permissions for three types of users: the owner of the file, a group of users set up by the server administrator (in this case, a "Whitman" group), and all others, that specify what actions each can be performed and by whom on the file. For this project, both "read" and "write" permissions must be enabled for the owner, the group, and others. If the rights column for a file reads "rw-rw-rw-" or "rw-rw-rw," with Ss for any of the dashes, there is nothing else you need to do. Otherwise, you will need to modify the permissions on the file. To do so, right-click the filename and choose "Properties." The bottom section of the properties box is for assigning permissions. Click the "w" box if it is unchecked for either group, other, or both. (By default, the owner of the file should have write permission.) Then click "OK." If you need to change the permissions for more than one file at a time, select multiple filenames using either Shift or Ctrl, in combination with clicking on the files. Then right-click and choose "Properties," and select the appropriate write permissions.

Please upload files as you complete them or at the end of a shift. Do not wait until you are days or weeks into the project to begin uploading the files. Liz will begin checking and editing the files as they are uploaded.

Write an introduction

Write an introduction to the particular language in the translation section, or, if such an introduction already exists, add an introductory paragraph for the new translation you have encoded and uploaded.

Technological Details

Choosing a Text Editor

The Whitman Archive's text editing and encoding program of choice is Oxygen. It would be best if you also used this since we are most familiar with this and can assist you in case there are any problems. You can get a copy of an academic version for about $60.

Annotated XML Translation Template and Sample Files

The TEI P5 template is available here.

Here is a sample file showing how to encode a series of French poems published in a periodical in P5.

Before beginning any transcription or encoding work, review the template and sample files, paying particular attention to text within <!-- -->. This text will appear in green if you are using Oxygen. The template includes content that will be the same in all files as well as instructions on providing information specific to individual letters. The sample files are fully transcribed and annotated poems in translation that have been completed based on the template.

Encoding Guidelines

File Header

Every XML document we create has a "header," which carries essential information about who is responsible for creating and publishing the document, the source of the text we are marking up, and kind of electronic title page. The header is analogous to a book's first few pages, which inform you of the author, publisher, copyright date, terms of publication, etc.

Since much of the information in the header is the same for all of the XML documents we create, we recommend that you use the annotated P5 template (to download the XML file, right-click on the link) to simplify your encoding of it.

Below, you will find descriptions of the main parts of the header.

The <teiHeader> has two principal components:

  • <fileDesc> contains a full bibliographic description of an electronic file
  • <revisionDesc> summarizes the revision history for a file

These elements are arranged within the <teiHeader> in this order, so the overall structure of <teiHeader> is this:


<fileDesc> File description

<fileDesc> should contain the following components:

  • <titleStmt>
  • <editionStmt>
  • <publicationStmt>
  • <notesStmt>
  • <sourceDesc>

<titleStmt> Title statement

The title statement includes
1) the title given to the electronic work (which here always includes the subtitle provided by us: "a machine readable transcription")
2) the author
3) the editors
4) information about others responsible for aspects of the electronic text
5) the name of the sponsors and funders.

Here is an example of a full title statement:

<title level="m" type="main">Song of Myself</title>
<title level="m" type="sub">a machine readable transcription</title>
<author>Walt Whitman</author>
<editor>Kenneth M. Price</editor>
<editor>Ed Folsom</editor>
<resp>Transcription and encoding</resp>
<name>The Walt Whitman Archive Staff</name>
<sponsor>Center for Digital Research in the Humanities, University of Nebraska-Lincoln</sponsor>
<sponsor>University of Iowa</sponsor>
<funder>The National Endowment for the Humanities</funder>

<editionStmt> Edition statement

The edition statement gives the current date. Example:


<publicationStmt> Publication statement

The publication statement includes the unique id number <idno>, distributor <distributor>, address <address>, and a statement of rights and availability <availability>.

<distributor>The Walt Whitman Archive</distributor>
<addrLine>Center for Digital Research in the Humanities</addrLine>
<addrLine>319 Love Library</addrLine>
<addrLine>University of Nebraska-Lincoln</addrLine>
<addrLine>P.O. Box 884100</addrLine>
<addrLine>Lincoln, NE 68588-4100</addrLine>
<availability><p>Copyright © 2010 by Ed Folsom and Kenneth M. Price, all rights reserved. Items in the Archive may be shared in accordance with the Fair Use provisions of U.S. copyright law. Redistribution or republication on other terms, in any medium, requires express written consent from the editors and advance notification of the publisher, Center for Digital Research in the Humanities. Permission to reproduce the graphic images in this archive has been granted by the owners of the originals for this publication only.</p></availability>

<notesStmt> Notes statement

<note type="project" target="#dat1"></note>
<note type="project">The following are responsible for particular readings or for changes to this file, as noted:
<persName xml:id="bb">Brett Barney</persName>

Note on <persName>: The value of @xml:id should be your initials, in lower case.

<sourceDesc> Source description

The source description provides a bibliographic description of the text copy used in the creation of the present electronic text. Example:

<author>Walt Whitman</author>
<idno type="callno"></idno>
<orgName>The Walt Whitman Collection, Harry Ransom Humanities Research Center, The University of Texas at Austin</orgName>
<note type="project">Transcribed from our own digital image of original manuscript.</note>

Note on <idno>: This provides the title by which the object is identified at the repository. You should be able to find it in the individual finding aid for the repository in question. Click here for a list of available online finding aids.

Note on <orgName>: The institution that holds the manuscript should be cited as listed in the Preferred Citation table in the References section of the Encoding Guidelines (see below).

Notes on description of source: This information is about the copy text, and the <title> here (as opposed to the one in titleStmt) should be given exactly as it appears in the records of the institutional repository, no matter how imprecise or wrong-headed their conventions may seem. Many times, the most specific title for the material will be that given to the folder used to store it, since few archives assign a title to each individual item; often, therefore, the <title> given in the <sourceDesc> will be a folder label.

<revisionDesc> Revision description

The <revisionDesc> element is used to summarize the changes that have been made to the file. If multiple changes are performed at different times, add another <change> at the top, so that changes are listed in reverse chronological order (most recent change first). To describe the tasks in our routine workflow, choose from the following terms for the content of <change>:

   * Transcribed; encoded
   * Checked; revised
   * Edited
   * Blessed 


<change when="2010-08-12" who="#bb">transcribed; encoded</change>

Unique Identifiers

Unique identifiers are one-of-a-kind names assigned to each electronic text we create. That is, every poem, collection of poems and work (for an explanation of "work" vs. "document" click here) must have a unique ID.

For translations, IDs are made up of the 3-character code "med" (which stands for "mediated") plus a 5-digit number (assigned in ascending order), with the two fields separated by a dot.

Example: med.00400

We use a database to track the unique identifiers and our workflow as we transcribe, encode, and upload manuscripts. This database can be accessed here.

Placement of IDs

The unique identifier appears in two places in the TEI header:

  • As an attribute value in the TEI root element (the very first tag):

<TEI xmlns="" xml:id="med.00400">

  • As content in the <publicationStmt>:


Transcription File Names

To name the file when you save it, simply add the file extension ".xml" to the ID. Example: "med.00400.xml"

Basic Document Structure by Genre


  • For a document featuring one poem composed of a single group of lines, do not use a <div>. Instead, use a structure like the following:

<text type="manuscript">
<lg type="poem">
<head type="main-authorial"></head> <l>There is no word . . .</l>

  • For a single poem clearly divided into smaller chunks:

<text type="manuscript">
<lg type="poem">
<head type="main-authorial">One's-Self I Sing.</head>
<lg type="linegroup">
<l>ONE'S-SELF I sing, a simple separate person,</l>
<l>Yet utter the word Democratic, the word En-Masse.</l>
<lg type="linegroup">
<l>Of physiology from top to toe I sing,</l>
<l>Not physiognomy alone nor brain alone is worthy for the Muse, I<lb/>
say the Form complete is worthier far,</l>
<l>The Female equally with the Male I sing.</l>
<lg type="linegroup">
<l>Of Life immense in passion, pulse, and power,</l>
<l>Cheerful, for freest action form'd under the laws divine,</l>
<l>The Modern Man I sing.</l>

Note: Use type="linegroup" in <lg> tags to note multiple lines clearly grouped together (e.g. a stanza) and followed by space left intentionally blank.

  • For a manuscript containing two or more poems:

<text type="manuscript">
<div1 type="multiple poems">
<lg type="poem">[poem here; follow structure outlined above]</lg>
<lg type="poem">[poem here]</lg>


Prose should be divided into paragraphs using the <p> tag. No division tag is required in a prose-only document unless the prose is divided into separate intellectual units. For example, a manuscript requires <div1 type="section"> if it begins with one or more texts constituting an intellectual unit (e.g. an essay or a group of letters), then has a clear break (e.g., a sub-heading, a horizontal line, or white space), and is then followed by another group of texts that is distinct in form or content (e.g. one essay following another). In such a case, the discrete groups of paragraphs should be marked with <div1>s, or, if they are already nested within a larger <div1> structure, with <div2>s and so on. Except on title pages, line breaks <lb/> are not encoded. Also note that <lg>s are only used to mark up poetry, never prose. Example:

<text type="manuscript">
<div1 type="section">
<div1 type="section">
. . .

Mixed Genre

Many manuscripts contain single intellectual units which are a mixture of poetry and prose. (For an example, see the manuscript "Ashes of Roses," here) "Mixed genre," for our purposes, does NOT just mean a manuscript leaf with poetry and prose on it (for example, a poetic draft on the recto and prose on the verso). Rather, "mixed genre" signifies writing that is thematically unified, apparently part of a single draft, but made up of a mix of prose and verse, as when Whitman composes an early draft that combines trial poetic lines with prose notes or lists. For a mixed-genre manuscript, use a <div1> with "poem notes" as the value of the "type" attribute, like this:

<text type="manuscript">
<div1 type="poem notes">
. . .

Basic Elements for Marking Structure

The following elements are used to describe the structure of Whitman's poetic works:

<div1> Division

Used, with the type attribute, to mark structural units larger than the cluster or poem. Values for the type attribute include "book," "section," "contents," "poem notes," "title notes," and "multiple poems." The largest unit is marked as <div1>, and descending levels of <div> can be nested inside. Click here to read an explanation of the different type attributes that are used with <div>s when marking up Whitman documents.

<lg> Line Group

Function in the same way as <div>, but are used exclusively to mark clusters, poems, and structural sub-units within them (ie, groups of lines—"sections" or "linegroups"—that constitute distinct units within a poem). If the poem has no distinguishable sub-units within it, no further <lgs> are needed; if the poem has one or more sub-units, you need to mark each of those units with the appropriate <lg>. As with <div>s above, descending levels of <lg> are nested inside <lg1>. For example, for a manuscript of a poem broken into three linegroups, the poem itself would be tagged <lg type="poem"> and each linegroup would be tagged <lg type="linegroup">. The type attribute is required; values include "cluster," "poem," "section," and "linegroup." For an example of how you would encode a poem divided into three stanzas or linegroups, click here.

<head> Head

Marks the title. Used on all <div>s and <lg type="poem">s, even when the source shows no title. For a more thorough discussion click here.

Each <lg> and division tag can have its own <head> (and thus its own title). Head tags are required for <lg type="poem"> and for all division tags within <text type="manuscript">. Head tags are not required—nor typically necessary—on any <lg> other than <lg type="poem"> or on division tags within notebooks (<text type="notebook">).

The "type" attribute on this element is required to differentiate titles physically present on the manuscript from those assigned by our project and to distinguish between main titles and subtitles. Use one of these three values:

  • main-authorial (written by Whitman on the page)
  • main-derived (assigned by us, derived via the formula for titles (see next section below)
  • sub (subtitle written by Whitman on the page; "sub" is only used for secondary authorial titles and must be preceded by <head type="main-authorial">)

<l> Line

Used to mark a poetic line. Use <lb/> to mark a line break.

A sample structure might look like this:

<lg type="poem">
<l>I celebrate myself,</l>
<l>And what I assume you shall assume,</l>
<l>For every atom belonging to me, as good belongs<lb/>
to you.</l>

Indented lines

Use the "rend" attribute on the "line" tag to indicate indented value.

Possible values correspond to width of indentation, with "indented1" being the smallest indentation and "indented4" being the largest: "indented1," "indented2," "indented3," "indented4"


<l rend="indended2">Still may I hear his word.</l>

<p> Paragraph

Used to mark a paragraph within a prose text.

A sample structure might look like this:

<div1 type="section">
<p> . . .</p>
<p> . . .</p>

Other Common Elements


For poetry quotations:

<lg type="poem">
<l> . . .</l>

For prose quotations:

<p> . . .</p>

<hi> Highlighted text (italics, smallcaps, underlining, etc.)

<hi> (highlighted) marks a word or phrase as graphically distinct from the surrounding text. Typically, we use it to indicate that individual words, phrases, or sentences within larger structures such as lines and paragraphs are highlighted in the original through italicization, the use of small caps, underlining, etc. The <hi> element uses the "rend" attribute to specify the nature of the highlighting:

Value of 'rend' attribute Function
underline Indicates underscored text
italic used only in transcriptions of printed material or in project notes to mark titles of books.

<orig>, <reg>, <sic>, and <corr> Regularized Spelling and Corrections

<orig> (original form) contains a reading which is marked as following the original, rather than being normalized or corrected.

<reg> (regularization) contains a reading which has been normalized in some sense.

We often use these tags when encoding Whitman's poetry, for instance in cases where a word at the end of a poetic line is hyphenated. Because we wish both to record the lineation of the copy text and to enable searches for words that are broken by end-line hyphenation, we use the <orig> and <reg> tags to record the original and regularized readings. This will allow the original version to be displayed online while users can still search for the regularized form and be directed to the passage in question.

<sic> (latin for thus or so ) contains text reproduced although apparently incorrect or inaccurate and is used to represent a mistake by the author.

<corr> allows the encoder to provide a correction.

These corrections will enable searches to use standardized spelling and not require the searcher to know, for example, that Whitman misspelled "Buildings" as "Buldings" in this manuscript.

All of these elements are nested within the <choice> element, which groups a number of alternative encodings for the same point in a text.

Example for <orig> and <reg>:


Example for <sic> and <corr>:

<sic>the incorrect way it's written</sic>
<corr>the correct way to write it</corr>

Note: Sometimes what you might think of as a spelling error would more accurately be termed an alternate spelling. For words that are spelled in idiosyncratic—though not exactly incorrect—ways, use the <orig> and <reg> tags as described above. As an example, look at Whitman's spelling of "Shakespeare" in this image. Since this spelling of Shakespeare's name is one he himself used (and he never, as far as we know, used "Shakespeare"), it should be encoded as follows:


<table> Table

This element is generally, though not exclusively, used to encode a table of contents. Here is an example for how to encode a table of contents:

<div1 type="contents">
<head type="main-authorial">Table of Contents</head>
<cell>Poem 1</cell>
<cell>Poem 2</cell>

Note: If a table spans multiple pages, you will need to close the table before inserting the <pb/> element and then open a new table for formatting purposes. Example:


<pb facs="med.00400.577.jpg" xml:id="leaf289r" n="575" type="recto"/>

<list> List

<item>[text goes here]</item>

<gap> and <unclear> Text that is illegible, missing, or difficult to read

<gap>: This element is used when text is absolutely unreadable—when, for example, it has been torn or cut away, is obscured by deletion, or is simply illegibly written. Each <gap> needs a reason attribute, and you have the choice of two values: "cut away" or "illegible." Note: gap is an empty element (i.e, does not require a close tag).

  • <gap reason="cut away"/>: When a page has been torn or cut, leaving only stubs of the letters you want to transcribe, use this tag at the point in the transcription where the words would appear.
  • <gap reason="illegible"/>: Use this markup whenever the letters or words are present but unreadable.

<unclear>: When you believe you have an accurate reading of a difficult-to-read passage, but you are not completely confident, mark the questionable reading with the <unclear> element. Use the reason attribute to state the cause of the uncertainty in transcription, selecting from the values described above under <gap>. Use the cert (certainty) attribute to indicate the degree of confidence in the transcription. Its value will be one of the following:

  • low
  • medium
  • high
  • absolute

Also include a resp (responsibility) attribute to indicate your responsibility for the postulated reading, and as its value use your initials.

For example, if Andy Jewell is encoding a manuscript with an unclear deleted word that he thinks might be "herbage," he inserts this markup:

<unclear reason="deletion, illegible" cert="high%" resp="awj">herbage</unclear>

<note> Footnotes and annotations

The basic markup for encoding a footnote in a translation is:

<note type="editorial" resp="[name of translator]" xml:id="n1">[text of the note]</note>

Each note will be encoded within a <note> tag, and the tag will take the attributes type, resp, and xml:id.

If the author of the note is the translator or editor, the note type will be "authorial" if it is a footnote to a preface, introduction, or other materials directly authored by the translator, or "editorial" if it is a footnote to a translated text by Whitman. If the footnote was written by Whitman in the English original, the type will be "authorial."

The value for @resp will depend on the source of the note:

  • If it is one that we are using unchanged from the translator, the value of @resp will be the translator's last name, e.g. "reisiger" or "meira" (as in the example below).
  • If the note is one that we've written here at the wwa, the value of @resp is "wwa."

@xml:id provides a unique identifier for the note and will take the format "n[#]." Give the first note in the document the xml:id of "n1," and number subsequent notes chronologically. Note that the order of the notes in the document does not need to correspond chronologically to their xml:ids. For example, the third note in the document does not necessarily need to have an xml:id of "n3." In many cases, these numbers will correspond, but they do not have to, and you should not spend time reordering them if they get out of sync. The crucial things is that the values of xml:id are unique. The document will not validate if they are not.


Here is an example of a footnote that the Brazilian translator Luciano Alves Meira added to the Portuguese translation of Whitman's poem "Fancies at Navesink" to explain the word "Navesink":

<lg type="poem">
<head type="main-authorial">Fantasias em Navesink</head>
<!--Fancies at Navesink -->
<!--<relations><work entity="xxx.00330"/></relations>-->
<lg type="section">
<head type="main-authorial"><hi rend="italic">O timoneiro na neblina</hi></head>
<l>Navegando a vapor pelas correntezas setentrionais — (uma antiga lembrança do St. Lawrence,</l>
<l>Um repentino lampejo de memória retorna, não sei por quê,</l>
<l>Aqui, esperando pelo nascer do sol, contemplando do alto desta montanha)*<note type="editorial" xml:id="n18" place="foot" resp="meira">*Navesink: a entrada inferior de uma encosta marítima da Baía de Nova Iorque. (N. do E.)</note></l>

<pb facs="med.00400.489.jpg" xml:id="leaf245r" n="487" type="recto"/>

Page Breaks and Image Linking

We use the <pb> tag to indicate page breaks. This tag is inserted at the beginning of a new page, and, if available, a link to an image of the page is provided. You use <pb> tags in every document, even if they are only one page long. <pb> is an empty tag, which means that you never need to "close" <pb>, but just insert a "/" at the end of the tag. The first <pb> tag goes after the <body> tag and before the first <div> or <lg>. If there are multiple pages, i.e., more than one corresponding image, simply insert a <pb> at each place in the encoding that corresponds to the beginning of a new page. Often, these will occur at the close of one linegroup (</lg>) and before the opening of another (<lg>). Or, commonly, you will need to include a <pb> to indicate untranscribed verso material; this should be done after the <lg> or

closes but before the <body> tag closes.

If a page is blank it still needs to be encoded using the <pb> tag.

Each <pb> tag has three required attributes, "facs," "xml:id," and "type". Numbered pages also have the attribute "n."


<pb facs="med.00400.096.jpg" xml:id="leaf048r" n="94" type="recto"/>

The attribute "n" provides the page number as displayed on the individual page. If the page that you are encoding is unnumbered, omit the attribute (e.g. <pb facs="med.00400.096.jpg" xml:id="leaf048r" type="recto"/>).

The value of the "facs" attribute consists of the relevant image file associated with the page. In the example, the file named "med.00400.096.jpg" provides an image of page 94.

Note on "xml:id" and "type": These two attributes record the leaf on which a given page appears and whether it is a recto or a verso. The front side of the first leaf of a document will always be "leaf001r" and the back (or verso) will be "leaf001v" . In almost all cases with documents consisting of several leaves, "recto" and "verso" alternate, and each leaf has one of each. For example, the page breaks following the one in the example would feature the following values for "xml:id" and "type":

<pb facs="med.00400.097.jpg" xml:id="leaf048v" n="95" type="verso"/>
<pb facs="med.00400.098.jpg" xml:id="leaf049r" n="96" type="recto"/>
<pb facs="med.00400.099.jpg" xml:id="leaf049v" n="97" type="verso"/>
<pb facs="med.00400.100.jpg" xml:id="leaf050r" n="98" type="verso"/>
. . .

The "xml:id" value must always end in either "r" or "v"—even if there is only one image. When there is only one image, the "xml:id" value will almost always be "leaf01r."

How to Handle Unusual Document Order: In some instances, Whitman has written a single poem on the rectos of several leaves that also have poetic lines on the verso that are not part of the same poem. In this case, you must encode in a way that preserves the intellectual unity of the poem on the rectos. To do that, you will have to break the typical order of <pb> "id" values. That is, instead of "leaf01r," then "leafo1v," "leaf02r", "leaf02v", etc., encode the pages in an order that preserves the integrity of each poem. For example, if you have a manuscript with a poem written across the rectos of three leaves and other poetic lines written on the versos of leaves 1 and 3, the <pb> will have id attributes ordered like this: leaf01r, leaf02r, leaf03r, leaf01v, leaf03v. It is done this way to ensure that the material on the rectos of leaves 1-3 are all contained within the same <lg1 type="poem">

Unusual Characters & Marks

XML supports only the ASCII character set, which roughly corresponds with the set of characters on a standard keyboard. Not all of the characters you might encounter in a Whitman manuscript are part of the ASCII character set, so to represent one of these unsupported characters you will need to use the appropriate Unicode number—a string of numerals that begins with an ampersand and pound sign (&#) and ends with a semicolon (;).

The table below lists the Unicode numbers we are using on the project. It is important to use the numbers for the listed characters, even when it might be possible to key them in (as with the ampersand, for example) or to use a close approximation (e.g., two hyphens to represent an em-dash). For characters not listed, Unicode numbers are NOT necessary.

For the characters in the left-hand column to display correctly, you must have a Unicode font installed on your computer.

Common Characters
Character Function in Whitman Unicode Number
= Proofreader's mark for hyphen. WW sometimes uses "=" for compound words ("down=balls") and words split between two lines ("some=thing"). &#8209;
Em (long) dash e.g., "Not these—O none of these more" &#8212;
* An asterisk &#42;
& Indicates 'and' &#38;
© Copyright symbol &#169;
Checkmark &#10003;
½ Used often in Bowers's system of page numbering &#189;
¾ Used to indicate the fraction, occasionally on manuscripts &#190;
Indicates a new paragraph or a new line of poetry &#182;
File:RightPointing.gif A right-pointing finger &#9758;
File:LeftPointing.gif A left-pointing finger &#9756;
File:UpPointing.gif An up-pointing finger &#9757;
File:DownPointing.gif A down-pointing finger &#9759;

Special characters

Many languages feature special characters that cannot simply be typed using a keyboard configured for American English. These characters need to be encoded using Unicode. For instance, if you wanted to encode the character "ñ" (n with tilde), which is frequently used in Spanish for example, you would need to insert the Unicode characters "&#241;". You can look up the relevant code in lists such as this one.

Non-Latin alphabets

If you are encoding a translation in a language that does not use the Latin alphabet, for instance Russian or Hebrew, you will need to use a special character set that has to be declared at the top of the XML file.


For texts using the Latin alphabet, the first two lines at the top of an XML file would look like this:

<?xml version="1.0" encoding="ASCII"?>
<?oxygen RNGSchema="" type="xml"?>

If you want to encode a text written in the Cyrillic alphabet, you would have to replace the word "ASCII" in the first line with the code for the Windows character set for Cyrillic, "Windows-1251":

<?xml version="1.0" encoding="Windows-1251"?>
<?oxygen RNGSchema="" type="xml"?>

For texts written in the Hebrew alphabet, the character set would be "Windows-1255."

Other alphabets use the following character sets:

  • Arabic: Windows-1256
  • Greek: Windows-1253
  • Turkish: Windows-1254
  • Baltic languages: Windows-1257
  • Vietnamese: Windows-1258

Additional Resources for XML Encoding

  • "A Gentle Introduction to XML" ([1]).