Title: "What I Assume You Shall Assume":The Whitman Archive and the Challenge of Integrating Different Open Standards

Author(s): Brett Barney and Kenneth M. Price

Publication information: First published on the Whitman Archive. This paper was originally delivered at the 2004 Modern Language Association Convention.

Whitman Archive ID: anc.00006

The Walt Whitman Archive, begun in 1995, is dedicated to the creation of a vast electronic scholarly resource that will eventually include in one online site the full range of work by and about the renowned poet of democracy. Judging by the proliferation of references to Whitman in popular culture and the explosion of criticism since 1990—over 120 books and well over 1,100 articles have been published about Whitman in the last 15 years—it seems fair to say that interest in Whitman has never been more intense or more varied. Yet all of this scholarly work is based on an incomplete textual record. Even fundamental information about Whitman's work, such as a record of the drafts and notes that led to his great poem "Song of Myself," are still inaccessible. The material that the Whitman Archive is bringing together will allow for and in some cases necessitate a re-examination of what have been considered safe assumptions about his work.

One way to understand the goals of the Archive is to compare our efforts (in the digital age) to the most comprehensive print edition of Whitman's works: The Collected Writings of Walt Whitman. The Archive is a thematic research collection, one of a new genre that is electronic, interdisciplinary, multimedia, and thematically coherent. The Collected Writings was an NEH-funded project begun in the mid-1950s with the goal of compiling all of Whitman's writings in an "absolutely 'complete'" edition that was to include all of the published volumes of poetry and prose, along with his correspondence, notebooks, daybooks, manuscripts, journalism, and uncollected poetry and fiction. The Collected Writings now consists of twenty-two volumes published by New York University Press, two additional volumes published by Peter Lang, and another volume published by the University of Iowa Press. It includes six volumes of correspondence, six volumes of notebooks and unpublished prose manuscripts, three volumes of daybooks and notebooks, two volumes of published prose, one volume of early poetry and fiction, a three-volume variorum of the printed poems, and a one-volume reader's edition of the poetry. But despite four decades of energetic work by a team of eight scholars and the support of an additional six scholars on the editorial board, many of the original goals of the Collected Writings remain unrealized. For example, because of delays in preparing the manuscript of the projected six volumes of journalism, only in the last few years have the first two volumes appeared, issued by a different publisher, Peter Lang, and it seems very unlikely that the remaining volumes will ever be published. The original five-volume edition of Whitman's correspondence, arranged chronologically, contains an "Addenda," and two supplemental volumes were issued later, rendering the letters forever out of order. And now, only a few months after the seventh volume appeared, several more letters have already surfaced. The variorum edition of Leaves of Grass, which was slated to gather manuscripts, periodical publications, and book publications, ended up dealing only with the published books—a decision that has meant that vital documents regarding the poems' early evolution are still inaccessible.

Similar problems plague the other volumes of the Collected Writings. New materials surface; errors are discovered; scholars make new discoveries; yet the print edition remains frozen, becoming increasingly out-of-date, incomplete, and unreliable. The Walt Whitman Archive, because it is a digital project, has the potential of being able to remedy some of the main shortcomings of the Collected Writings. The many materials that were not included in the Collected Writings are now housed in numerous and scattered archives, and so one of our major efforts has been to gather digital images of those materials and electronically edit them. For example, whereas the Collected Writings printed in full only the final edition of Leaves of Grass (the so-called 'deathbed' edition) our site presents four editions (currently), with an additional two in progress. Current contents also include all 130 known photographs of Whitman, with scholarly annotations; a searchable annotated bibliography of Whitman scholarship from the last thirty years; an integrated item-level finding guide to his poetry manuscripts (held at approximately thirty repositories); and full transcriptions and high-quality digital images of nearly one hundred of the poetry manuscripts. (Approximately 250 other poetry manuscripts are in various stages of editing and encoding before receiving final vetting and public presentation.) We have begun editing the 150 poems Whitman first published in periodicals, and we are in the process of making available the nine volumes of conversations with Whitman that Horace Traubel collected in his With Walt Whitman in Camden; the first volume of that work came online earlier this month. None of the material in these examples was included in the Collected Writings.

Large-scale digital thematic research collections such as the Whitman Archive (as well as the William Blake, Dante Gabriel Rossetti, and Valley of the Shadow projects) serve the unique needs of scholars. Because of that, they differ in some important respects from other large-scale text digitization projects that are library-based. Too often, talk of interoperability has been conducted within a single community: for example, focusing on the use of a standard for data capture, or focusing on the sharing of metadata among libraries or among software and hardware platforms within libraries. And there has been too little cross-community outreach from libraries to scholars and from scholars to libraries. So even though scholars who create digital thematic collections must necessarily work with libraries—both as a source of primary materials and as the most likely destination of their work for long-term preservation and access—they are generally unfamiliar with the reasons why such collections should be built to conform to certain standards, and they lack the know-how that would allow them to implement those standards. Libraries, on the other hand, have largely driven the definition and development of digitization standards, but they haven't typically involved scholars in that development or used scholar-produced sites as test-beds. So closer collaboration will be required for complex and comprehensive thematic collections to be collected by libraries (or published by publishers). The problem we're describing is one of separated cultures: the divisions of labor that were acceptable and habitual in a print environment can create problems in a digital environment.

We believe that the sustainability of projects like the Whitman Archive depends upon compatibility with existing technologies and availability to future technologies and that those goals are best met through standardization—of data formats (e.g., TEI and TIFF), of metadata formats (e.g., EAD and MODS), and of the techniques and methods used to control the files and the interrelations among them (METS). To quickly clarify for any non-specialists in attendance, we'll gloss some of the acronyms that are in play here: TEI is the Text Encoding Initiative, which is a widely adopted XML standard for encoding humanities texts; TIFF is a high-quality image file format; EAD stands for Encoded Archival Description, an XML tagset for creating archival finding guides; MODS is Metadata Object Description Schema, a set of XML elements for encoding bibliographic information; and METS stands for Metadata and Encoding Transmission Standard, an XML tagset for identifying relationships among related digital files. The Whitman Archive is a project committed to building its digital resources in ways that are compliant with open standards at all levels, and we are either currently implementing or testing the implementation of each of these standards. Our experiences illustrate some of the challenges involved in bringing them together and deploying them in ways that are suitable to the special requirements of digital thematic research collections. These challenges include: treatment of materials which are physically dispersed; support for the very detailed encoding of data and metadata that a scholarly edition requires; and the tight integration of the images, transcriptions, finding aids, and administrative records created to carry that data. To return to a point we made earlier: it is crucial that creators of digital thematic research collections work in a way that is compatible with library and archive data standards because 1) the primary resources are held by archives and libraries, and many of the repositories are digitizing materials and making them available to scholars, even if the digital representations are frequently not as rich as those that the scholars will eventually create; 2) archives and libraries will, ultimately, be responsible for the collections that are created; and 3) if publishers are to have a role in the publication of digital thematic research collections, standards will be essential for portability and aggregation—and these standards will by and large come out of the library world. In the current situation, there is a separation of the librarians and archivists who are involved in the creation of standards from the scholars who, increasingly, need to know about the standards and help refine and advance them. We see several problems that this situation poses for the future of digital scholarly editions: 1) Projects are at great risk of floundering or of proceeding in idiosyncratic ways; 2) digital librarians, working to develop or install access systems for these materials, have no basis for reliable interoperability; and 3) standards organizations lack the information they need to guide their own strategies and future efforts.

Yet, if the obstacles can be overcome, digital projects that reunite archival materials found in multiple repositories offer possibilities (both for public presentation and for scholarship) that were unimaginable in the recent past. As we noted, Whitman's poetry manuscripts are scattered among many geographically dispersed repositories, and some types of research—for example, comparing drafts and establishing the history of composition—were really not possible until we could bring together high-quality color images with richly encoded transcriptions of all of the documents. Users stand to benefit greatly when EAD, TEI, and MODS files work interactively alongside high-quality images of archival material. In a single place, a user can locate items in a repository; situate documents in their intellectual and archival contexts; analyze facsimile images of the documents themselves; and manipulate the content in various ways, including, for example, reusing the material in a scholarly argument of the user's own making. The Whitman Archive illustrates some of these possibilities. The next slides show screenshots of steps a user might take in examining documents in the Archive. Here, using the integrated finding guide to poetry manuscripts, a user can locate all of the manuscripts that contribute to a given poem, look at transcriptions and scholarly commentary for each of the manuscripts, and view page images. Or they might go the other direction, starting out from a particular manuscript transcription, and use the annotation to locate in the integrated finding guide all related manuscripts, and then link to the individual repositories' finding aids to gather information about provenance, etc.

The potential power of standards-based digital thematic projects has not been fully realized because those of us working in humanities computing have yet to figure out how best to make the relevant standards interoperate. The Whitman Archive has made great progress in the use of TEI and EAD, and we have recently begun to explore using METS to help manage the integration of EAD and TEI encoding. One basic challenge is to figure out what role each standard is to have, and how those roles interrelate. For example, descriptive metadata—details about the intellectual content of an item—resides in EAD (descriptive metadata being a primary purpose of EAD), in TEI headers (where it serves a secondary purpose), and in METS. Should redundancies be eliminated, and if so, in favor of which standard? If not, how should they be mapped to one another and managed to guarantee agreement?

These are real issues, ones that reasonable persons can disagree about, as is demonstrated by the ways they have been handled in various other projects. One of those is LEADERS (Linking EAD with Electronically Retrievable Sources), a project based at University College London and funded by the Arts and Humanities Research Board. Using a set of George Orwell documents and university records in their archives, they have developed one approach to integrating TEI and EAD, a model which expands the EAD header and removes metadata from the TEI file. Their approach, while it has much of interest and value, uses EAD for purposes it was not designed to meet and separates metadata from the data it describes, thus complicating any possible future aggregation with other data sets. Our own work is different from LEADERS in a couple of important ways. First, it adheres closely to accepted practice in the application of EAD. That's crucial because we are dealing with about thirty different collections, which are dispersed around the US and Europe. The LEADERS project, in contrast, deals only with materials at their home institution. So, unlike LEADERS, we are working collaboratively with other institutions and pulling in their EAD work and using it in our own larger aggregation. In the course of our work we are creating something new (the integrated finding guide), not digitizing something that already exists in another form.

Another important difference is our commitment to METS and its extension schemas (e.g., NISO-MIX, an image metadata schema, and MODS). Besides its potential to manage integration of metadata among the various kinds of files on our site, we plan to explore its use as a submission tool—as a standard for "packaging" the Archive for digital libraries. So far, we have only done preliminary work in this area, but the increasing adoption of METS in the TEI and EAD communities makes further research into its application to thematic research collections compelling. Several important projects are testing, developing, and implementing METS in systems for submitting and retrieving disparate components of libraries' and archives' digital resources. However, its applicability to collections like the Whitman Archive will require scholars to participate in further testing and development efforts. We hope to receive grant funding to develop a model METS profile for thematic research collections and to test the use of METS for submission to two different digital library systems: FEDORA (at University of Virginia) and a more highly generalized system (at Brown University). Furthermore, we plan to share the results of our work with the MLA Committee on Scholarly Editions (CSE), the ADE, the National Historical Publications and Records Commission (NHPRC), and other humanities professional organizations.

By doing so, we will be helping to build the digital and intellectual bridges that make future work on thematic research collections more efficient and sustainable. In our past and present work on the Whitman Archive we have begun wrestling with the problem of integrating standards developed by different communities working largely in isolation from one another. Further work on the Whitman Archive will, we hope, develop solutions that will serve as a model for other scholarly digital projects and inform the future development of the standards.


Distributed under a Creative Commons License.