Week Three Reading Blog: The Pitfalls of Digitally Converted Sources

If one message emanated loudly from this week’s readings on digitization, it was: “Be careful!”  Each author consistently offered a cautionary tale about the potential pitfalls of digitizing analog sources for use by historians and other scholars while simultaneously (but somewhat hesitantly) conveying enthusiasm for the vast opportunities offered by such digitization efforts. Even though Dan Cohen and Roy Rosenzweig proclaimed rather proudly that the “past was analog” and the “future is digital” (Cohen and Rosenzweig, 80), they readily admitted that the digitization path they advocated was truly ‘undiscovered country’ fraught with potential problems and unintended effects.  For me, the digitization of sources, particularly newspapers, represents a windfall in accessibility; but, as Ian Milligan cautioned, historians must know the strengths and limitations of the digitized sources they are using, must be transparent about these sources’ digital nature, and must understand how they were constructed from their analog source (Milligan, 566-567).

A great deal of skepticism surrounded one technique in particular: Optical Character Recognition (OCR).  Issues ranging from accuracy, cost, and digital portrayal seem to mark OCR as perhaps a useful tool for some very specific purposes (such as data mining) but not as a substitute for the source itself. I agree fully. My limited exposure to OCR-digitized newspapers in ProQuest’s Historical Newspaper Database suggest that although OCR-captured works may assist in locating broad trends with regard to selected words and topics, there is no substitute for the historian viewing the original newspaper as it physically existed  for its contemporary readers. This aspect of how people experienced the news is important to me because my dissertation concerns how radio and newspaper media constructed a collective American memory of D-Day (6 June 1944) at the time the event was occurring.  What excites me about OCR-digitized newspapers is not only the availability I mentioned earlier but the ability to conduct key-word searches to pick up from larger corpuses of selected American newspapers the degree to which those papers discussed or referenced the coming invasion in the six months leading up to 6 June 1944.  In fact, my brief tinkering with data mining in ProQuest’s newspaper archive has what may be a new feature (at least since some of the readings were published) that steered me toward an actual facsimile scan of the newspaper articles based on “hits” from my search. Unfortunately, the articles were isolated from the paper’s original layout, so I could not experience the text in the same context as the contemporary reading audience.

In the main, I see great possibilities for digitization — as long as digitized sources supplement other ways that historians locate and interact with source material. As someone who has spent long hours in the National Archives pulling from the shelves box after musty box of World War II documents,  I have always found great value in seeing and holding original documents with all their inherent imperfections and unique graphics. As Simon Tanner pointed out through numerous examples, OCR doesn’t do a good job in capturing text from complex layouts that use graphics and other images. Newspapers fall into this category, and seeing original papers and their often varying photographic quality (and even use of unbleached pulp paper) can tell a lot about how people might remember what the paper was reporting and how the images shaped their memory. I know that Sarah Werner facetiously characterized this approach as “nostalgic fetishizing,” but I think this type of historical research still has a place in the world of digital sources, especially since scholars like Marlene Manoff are starting to see electronic objects as material objects themselves (Manoff, 312).  The danger here is that some electronic permutation of a document or photograph – enhanced or refashioned — becomes an actual material substitute for the original analog version.  That concept gives me pause — and some reason for concern. Will historians find themselves relying almost exclusively on digitally enhanced sources that did not exist in the same form as when they were created? Isn’t that notion the very idea behind the term “ahistorical?”

In any case, and despite my long-standing penchant for using original analog documents as sources, I am most excited about the possibilities of accessing online a broader range of material through digitization. The seven or eight major newspapers now fully digitized and in ProQuest’s database are a great source for me to test out some quantitative theories about how heavily D-Day weighed on the public mind in the months leading up to the invasion and how the media might have portrayed the invasion as an “America-only show.” However, I am painfully conscious of the fact that the data corpuses of many other newspapers are limited and, in some cases, skewed toward certain populations. For example, I am still on the hunt for good representative samplings of newspapers that targeted specific audiences, such as African-Americans.  These newspapers’ perspectives on D-Day are highly relevant to my research. Personally, I have no problem digging for those tough-to-find sources in hard-copy archives or wherever they may be, but I share Ian Milligan’s fear that lazy historians won’t do the leg work needed to dig up obscure sources and will instead rely solely on what is available digitally. I found Milligan’s contention that only digitized Canadian newspapers were figuring predominantly as sources in recent dissertations to be a bit shocking.  His point dredged up for me fears that digitized sources — and only digitized sources — might represent for many up-and-coming historians the left and right boundaries of their research efforts.  Milligan’s concern contradicts Sean Takats’s fear that an abundance of digital sources is the real danger.  Frankly, I think what gets digitized and when will be the biggest problem.  Abundance probably won’t be the real issue.  In my own work in years past, I reveled in finding some previously undiscovered source — that “gem” – that added a unique insight or, dare I say it, “flavor” to the history I was writing. I would hate to see future historians discouraged from doing the necessary leg work to find those obscure sources that will build upon existing historical arguments in rich and informative ways.

Ultimately, digital sources will be (and are now) a boon to all historians — professional and amateur alike. The challenge will be not identifying what Bob Nicholson called “the digital turn” (we’re already there) but for historians to learn the strengths and weaknesses of the digital sources they use and to employ them accordingly in the service of their historical arguments.  But a digitized source should only be one type of source, not the only source upon which we as historians rely.  We still need to dig in those musty old archives to find those undiscovered “gems” awaiting the light of day.

Steve Rusiecki


