Monthly Archives: September 2014

Week Five Reading Questions: Text Mining and Topic Modeling

1.  Describe text mining in the context of the readings. What are its possibilities for historians? What are its pitfalls?

2.  Frederick W. Gibbs and Daniel J. Cohen believe that text mining is more relevant to open-ended questions, in which “the results of queries should be seen as signposts toward further exploration rather than conclusive evidence” (Gibbs and Cohen, 74).  Explain what the authors mean by this statement.

3. Ted Underwood contends that historians must overcome two obstacles before engaging in text mining: (1) getting the data you need, and (2) getting the digital skills you need. What digital skills does Underwood feel that historians should develop?

4.  According to Cameron Blevins, literary scholar Franco Moretti developed the digital method of “distant reading.”  Describe the concept of distant reading. How is distant reading different from text mining?  How is distant reading useful for historians?

5.  Cameron Blevins argues that the promise of digital history is “to radically expand our ability to access and draw meaning from the historical record” (Blevins, 146).  Do you agree? What other possibilities might Blevins be overlooking?

6. What is “topic modeling?” How does it relate to text mining and distant reading? How is it useful for historians?

7. According to Ted Underwood,  an Internet search is a form of data mining. But it is only useful if you already know what you are expecting to find.  Do you agree? What is Underwood’s remedy for seeking the unknown and the unexpected from the digital record?

Week Six Reading Blog: Visualizing the Past is a Great Thing

Like many people, I tend to respond positively to visualizations of ideas, concepts, and topics. For me, history is a very visual enterprise, and I have never been able to imagine producing history without some accompanying visual representations to enhance the prose and the ideas I set forth. For the books I’ve published on World War II topics, photographs and maps have always been my visual mediums of choice.  In fact, I still get my hackles up thinking about the “battle royals” I had with my publishers over how many photographs and maps I could include in my books.  They always low-balled me from the outset, so I had to scrape for every additional photograph or map. Alas, publishing is capitalism at its finest. The more visual representations in the book, the more these additions affected the publisher’s bottom line.  The scars I suffered in these battles are still with me, and I’ve never felt that my works were as complete as they could have been without the full complement of photographs and maps that I intended to use.  But now, with the emergence of digital history, and new ways of providing digitized visualizations beyond simply maps and photographs, those “battle royals” may be a thing of the past.  Hard copies may still be limited in what graphics and photographs they can include due to cost considerations, but electronic Kindle editions and the use of Web sites to supplement published hard-copy works certainly offer a more feasible and cost-effective approach to extensive visual portrayals of the past.

Although maps and photographs have been my traditional “visualizations of choice,” I have always been drawn to histograms, tables, and other graphs as possible ways to portray selected bits of information visually, particularly when depicting significant changes over time or for quantifying particular assertions. Yet developing such quantitative representations have always proved daunting for me, principally because I believed that they failed to capture the ‘fuzziness’ of some interpretations appropriately.  For that long-standing reason, Johanna’s Drucker’s article about the subjective, interpretive nature of certain data — which she labels as “capta” — made perfect sense to me.  My past experiences with attempting to employ otherwise quantitative tools to portray subjective data, or capta, have always come up flat. I worried that graphic portrayals that could not account for the ambiguity inherent in interpreted data might suggest an attempt by me to engage in “quantitative manipulation” or some other form of fallacious reasoning to make my point. Frankly, I’ve always been skeptical of statistics and other data, since they are prone to such easy manipulation.  And, for that reason, complicated graphs and tables have been something that I’ve simply “jumped over” in my readings of historical monographs. Most of the time, figuring out what those visualizations were trying to say proved too aggravating and time-consuming. Few of them met John Theibault’s standards of being “transparent, accurate, and information rich.”

But the examples provided by Drucker in two of her article’s figures, Figures 2 and 4, really worked for me. They showed quite readily, as she intended, an alternative way of portraying data (or capta) that was, in her words, “taken” and not “given”; they depicted the “fuzziness” inherent in the capta – the external and internal factors that shaped and re-shaped that information without scuttling the larger point being made.  Clearly, Drucker has made both a philosophical and practical point of the highest order;  historians must — and I mean “must” — find visual techniques to present capta in a way that represents the humanistic methods that generated it, methods that, as Drucker claims, “are counter to the idea of reliably repeatable experiments or standard metrics that assume observer independent phenomena” (Drucker, numbered paragraph 13).  In fact, Lauren F. Klein, in analyzing Jefferson’s correspondence for “breadcrumbs” about James Hemings, cites her research efforts as proof of Drucker’s broader assertion that graphic techniques applicable to the empirical sciences can mask  the subjective biases of historically interpreted capta.  I agree, and I can’t wait to experiment with visualization techniques that will subordinate the quantitative to the qualitative.

Many of the sample visualizations that John Thiebault included in his article stirred my imagination and opened up many possibilities for effective data-visualization techniques. Density maps in particular demonstrated both transparency and meaning; they communicated to me, through colorized graphics overlayed on actual geographical  representations (like the entire United States), the ability to track specific events over space and time. This approach intrigued me most because of the possibilities of using similar maps (albeit with animation) to portray the movement of specific units engaged in, say, a World War II battle.  In fact, the idea of recreating maps from my two books in such a format fascinated me; by posting them on the Web as supplements to my hard-copy books, readers could follow troop movements and engagements in real time over actual maps of the terrain, creating a visual narrative that would not only complement but perhaps enhance significantly a reader’s understanding of a battle’s flow and the inherent friction in war. As Thiebault rightly opined, “Animation increases [the visualization's] interpretive force dramatically.” Oddly enough, back in the late 1990s, the U.S. Army tried to use a similar visualization tool with computers to track combat forces in real time over actual terrain as a battle was unfolding. The tool did not work exactly as planned but instead morphed into something better that the Army uses today. Theibault’s visual examples remind me of those early battle-mapping graphics and how quickly such things can develop into other, more effective tools over time.

The visualizations that are least effective for me are the ones with nodes, “edges,” and abstract visualizations — basically network graphs. I prefer to see visual information grounded contextually in something familiar, like a map or a basic graph. The graph that Elena M. Friot generated with the Gephi program simply does not work for me.  And I’m inclined to agree with Scott Weingart’s assertion that “network structures are deceitful,” primarily because they rely so heavily on adhering to specific input rules. If you don’t enter your values (or whatever) properly, God only knows what will come out at the other end. Probably the Frankenstein’s monster of all graphic portrayals — a hodgepodge of complexity and confusion. But visualizations are a good thing — a very good thing – and I want to engage in more opportunities to use them with my work.

Steve Rusiecki

 

 

Week Four Practicum Blog: Will Digital Sources Scuttle a Journal Article?

My survey of the main articles published in the last three years of the Journal of the Civil War Era suggest that until precisely one year ago, scholars submitting articles for publication avoided citing Web sites or any other digital sources. Could it be that citing such sources, or even using a database, in this particular journal was the kiss of death until this past year? These results intrigued me, so I carefully surveyed the footnotes in each journal article beginning with Volume 4, Number 3 (September 2014) and worked backward to Volume 2, Number 1 (March 2012). Overall, the dearth of digital sources, or at least the admission by an author that he or she used a Web site to locate a cited source, was striking, particularly since we know with some certainty that many historians routinely shop for source material on the Web, especially for journal articles.

Each edition of the Journal of the Civil War Era carried an average of three articles in addition to standard features such as editorials and book reviews. I focused solely on the articles and visually scanned each footnote section for any hint of a database in use or the mention of a Web site. The most recent edition published this month (September 2014) yielded no hits. But the June 2014 issue seemed like I hit the jackpot with an article by Chandra Manning, whose monograph titled What This Cruel War Was Over (2008) just happens to sit near the top of my personal hierarchy of masterful scholarship. In this particular article, Manning cited Web-based or digitized sources in six separate footnotes. Her use of each Web reference did not discuss any specific ways of using the Web site other than as a repository for some specific information. In one case, she argued quite pointedly that “antebellum state constitutional conventions demonstrated that a citizen was someone seen by others in the community as independent, self-reliant, and capable of contributing to ‘the harmony, well-being, and prosperity of the community.’” Her source for this assertion was a database titled “Debates and Proceedings of the Convention for the Revision of the Constitution of the State of Indiana, 1850″ (http://indiamond6.ulib. iupui.edu/cdm4/document.php?CISOROOT=/ISC&CISOPTR=6357&REC=10). She even offered up in the same footnote a database of state constitutions titled “The NBER / Maryland State Constitutions” (http://www.stateconstitutions.umd.edu/index.aspx) in case anyone wanted further examples. Unfortunately, she provided no insight into how she used these databases other than as repositories for specific historical documents. Given the  methodological transparency of her 2008 monograph, I was surprised that she was not more specific in describing how she used these digitized sources.

My sense of “hitting the jackpot” diminished quickly as I continued my backward trek through the journals. Manning would turn out to be the most prolific “citer” of digital sources whom I would encounter in the practicum.  In the March 2014 issue, three authors cited at only one digital source each to support a specific assertion. For example, Nicholas Marshall relied on a digital database titled “Dyer, A Compendium” (http://www.civilwar.net/searchstates.asp?searchstates=Ohio, http://www.civil-war.net/searchstates.asp?searchstates=Massachusetts) to distinguish war-related deaths from those that occurred due to illness or accidents. In that same issue, Sarah Bischoff Paulus used a database named “America’s Historical Newspapers” (http://www.readex.com/content/america%E2%80%99s-historical-newspapers-college-edition-1690-1922) to locate April editions for the years 1854, 1855, and 1856 in order to illustrate the admiration many Americans felt for the late Henry Clay through news reports of annual observances of his birthday. At least in Bischoff’s case, her use of the database seemed obvious: she searched for newspapers published in mid-April for a specific year and then selected the editions that mentioned celebrations of the late Henry Clay’s birthday.

Only a couple of articles in each of the September and December 2013 editions yielded any evidence of digital sources in the footnotes. In her article in the September issue, Beth Barton Schweiger queried the “UNESCO Institute for Statistics” database (http://stats.uis.unesco.org/unesco/TableViewer/document.aspx?ReportId=121&IF_Language=eng&BR_Country=7160&BR_ Region=40540) for the 2010 adult literacy rate in Zimbabwe (92.2 percent) to contend that high literacy rates did not necessarily equate to economic prosperity. In her article in the December issue, Thavolia Glymph used the “National Register Properties in South Carolina” database (http://www.nationalregister.sc.gov/berkeley/S10817708014/index.htm) to illustrate the historical significance of an antebellum inland planter-class settlement in South Carolina named Pineville Village. In both cases, the authors neither elaborated on their search techniques nor explained how they manipulated the databases they cited. My own impression was that they simply used these databases as encyclopedic repositories for information they could obtain from a basic search.

The next five editions of the journal represented a remarkable dry spell in digital source citations. Not a single article mentioned any type of digital source, even though several articles relied upon newspapers that could only have come from an online database search. The final digital entry I located came from the very last edition in my survey population, the March 2012 issue. In that edition, Matthew C. Hulbert used the Missouri Division’s “Sons of Confederate Veterans” Web site (http://www.missouridivision-scv.org/littledixie.htm) to argue for Missouri’s cultural link to the Old South and the Confederate cause. Once again, in the absence of any further explanation, I can only assume that he accessed the database strictly to see the information posted there.

After this detailed survey of three complete volumes (12 editions) of the Journal of the Civil War Era, one message echoed very loudly: citing digital sources for an article published in this particular journal was not (until very recently) a widely accepted — or perhaps tolerated — practice. More than half of the editions surveyed did not even list a single hyperlink in any of the articles’ footnotes, despite the citing of some sources the authors almost certainly found online.  I am hesitant to jump to conclusions based on such limited evidence, but this exercise has suggested to me that until very recently (and I mean within the last 12 months), transparency in the use of digital sources has remained elusive in the world of academic historical journals. The Academy seems to be undergoing its own “cultural turn” as digital sources claw their way to relevance — but ever so slowly. Few scholars seem as bold as Chandra Manning in identifying those sources for what they are – digitized products housed in online databases. From my perspective, digital sources are fully acceptable as long as the methodology remains clear. The potential pitfalls I have already witnessed in my brief exposure to digital  history suggests that we can’t accept all digital products at face value. We have to know the strengths and weaknesses of our digitized sources and explain to our readers how we compensated for those factors when we employed these sources in the service of our arguments. That standing theme of  cautious optimism when using digital sources dominates the existing landscape of digital history — and for good reason.

Steve Rusiecki

 

 

 

 

 

Week Five Reading Blog: The Power of Text Mining and Topic Modeling

Since this course began, I found myself reading ahead in an effort to find those digital tools that might prove most useful to me in my research. The two techniques that excited me the most were text mining and topic modeling, each of which suggested to me ways of negotiating large swaths of newspapers related to D-Day and the months leading up to the invasion without spending days and possibly months sifting through irrelevant online or on-site archives.  Now I could compile the relevant sources quickly, but I also recognized that these mining and modeling efforts could not supplant my need to read every relevant primary source in its original form — as an OCR image or otherwise — in order to evaluate its content and context properly.  I was equally excited to know that these tools could spit out graphs and histograms that could add a visual impact to any assertions I might make from a quantitative perspective.  Once again, however, I am mindful  of the strong thematic thread running throughout this week’s readings and the readings from other weeks: know the shortcomings of these tools and proceed with caution.

The assertion made by Frederick W. Gibbs and Daniel J. Cohen that a tool like text mining can open gateways to further exploration rather than [serving as] conclusive evidence”  resonates strongly with me (Gibbs and Cohen, 74).  Like these two historians, I don’t see the results of a text or topic search standing alone as evidence. In some cases, the quantification of hits can help make broader points, such as Cameron Blevins’s foray into defining regional spatiality in certain newspapers. His extensive text-mining efforts allowed him to argue that newspapers “privilege . . .certain places over others,” thus creating an “imagined geography” for their readers (Blevins, 124).  The visual mapping he generated proved equally impressive and useful in driving home his point. But Blevins was careful to point out that the traditional skills of the historian — that is, close reading and contextual source evaluation — has never been more necessary as a way to avoid falling into the superficiality trap.  In my circumstances,  text mining will prove useful for determining the degree to which newspapers featured articles on key Allied leaders leading up to the invasion or, once the invasion began on 6 June 1944, privileged coverage of the Allied efforts in the West over those of the Soviets in the East. The benefit of these searches are two-fold: quantification on one hand and, on the other hand, the ability to read closely for context and meaning only those editions that yielded “hits”.  But I also agree with Ted Underwood that a basic search, which is how he seems to define text mining, will generally give you what you probably expected to find. But what about finding those unknown “gems?” I think the second tool, topic modeling, will help me in this endeavor.

I found Robert K. Nelson’s “Mining the Dispatch” to be a remarkably effective discussion of the possibilities, and potential pitfalls, of topic modeling, which he defines as a “probabilistic technique to uncover categories and discover patterns in and among texts.” What excites me about the prospects of this technique is that the pairing of words as a way to search for macro-patterns within specific topic areas can lead to new discoveries.  For example, in the context of my research, the editorials addressing specific aspects of D-Day in the months leading up to the invasion cover many, many different topics, most of which could be difficult to narrow down and categorize by hand. Topic modeling seems to offer me a way of focusing more quickly on the numerous themes embedded in both opinion columns, editorials, and other articles.  But topic modeling also seems to have the most pitfalls. Micki Kaufman suggests that topic modeling requires a specific skill-set that most historians lack. If Kaufman is correct, then many historians who struggle to master such a skill (or skills) may find themselves avoiding the tool altogether or grossly misinterpreting their results. Or, as Ted Underwood suggests, if programming skills become necessary because the project’s scope is too great, many historians may opt out. Frankly, after reviewing the  graphic portrayals of Kaufman’s analysis of Kissinger’s Memcons and Telcons using the topic-modeling software MALLETT, I had difficulty making sense of the images and what they were trying to tell me. They supposedly portrayed meaningful word correlations; but, in the absence of further explication, they simply puzzled me.  I guess my mind just doesn’t work that way. Nelson’s approach seemed more in line with what I would consider to be possible given my own skills, particularly the categorizing of topics into written categories.  Topic modeling is one digital tool that I want to try for my project, but I’m a bit apprehensive of what the results may yield and my ability to interpret them properly.

Overall, text mining and topic modeling intrigue me greatly. The only thing to do now is put them to use. But some big questions remain for me.  How will  I apply software such as MALLETT and Google Ngram viewer to the newspaper databases found on Newspapers.com, Chronicling America, or ProQuest? The more I consider the technical aspects of putting these tools into practical use, the more uneasy I feel. But isn’t getting out of one’s “comfort zone” the path to bigger and better discoveries?

Steve Rusiecki

Week Three Practicum Blog: OCR and My Exercise in Frustration

Good Lord. Until this practicum, I thought I was fairly adept at manipulating computer programs. But to make parts of this exercise work, I felt like the proverbial one-legged man in an ass-kicking contest.  I feel more traumatized now than when I experienced ground combat for the first time. Just kidding — maybe. First, let me address my experience with the Google Drive OCR program.

I will admit that once I was able to open the older version of Google Drive OCR, the program proved rather easy to use.  I uploaded my assigned document from Prof. Robertson’s file (number 7) and ran it through the program. The instructions were easy to follow and worked perfectly. But I encountered two significant problems, one of which was practical computing (how to save the jpeg I created to my desktop) and the other the nature of the OCR results.  First, let me address the results. As you can see from the image below (sorry, but I could not figure out how to align it in the WordPress program), the typewritten script is very clear. Despite my efforts to crop the image and increase font size, I could not make the typewritten words any clearer than when they appeared in their original form.

OLYMPUS DIGITAL CAMERA

Next came the results. Wow. The digitized text following remediation appeared to be written in Klingon, Pig Latin, or some other unintelligible jib-jab.  I expected a far better result given the clarity of the typewritten page. Here’s what the program gave me:

ma charge of the translation. I admit Burns r0001″! money. I admit ho return! some monoy, and 1t Isl returned been“. by all and”.1 30 air, I didn’t.

A No sir. I think I got my information from tho audit department. I could explain to you how tint cam.

HR. BOARDMAR: is hero and be tho but proof of that. I think I will sustain the objection.

Q In your discussion with reference to thin lunatigaticn that yen wore making was there anything othor than innltigatm the birth of this 011116?

A Yu, than m the minus back of the pushing of the 0mg” against Thu. That m about all, an! to fin! out who :11 nor. inhrutoa Ln pronoun“; Thaw uni that their mutt?“ wort.

Q D16 you in flat connutlcn make my innitigatlon to “M out in or around I!” York in” how Ml “cap. I” uooonplilhoa and anything in commotion with um and how they In” trying to 301: Mm back?

As you can see, very few sentences proved coherent. In fact, in the absence of reading the original primary-source document, I could barely discern the topic of the interrogatory. I even manipulated the text after I imported it into the blog so that it would at least resemble the original in terms of format.  The error rate was well above 70 percent in this case, which shocked me. The digitized newspapers archived on the “Chronicling America” Web site fared much better, but I will discuss those results below. The impact of this error rate suggests that the OCR-generated text is marginally useful at best — and then perhaps only for indexing. Frankly, my confidence level in OCR dropped markedly after this exercise. I was unable to compare my results with Andrew’s, because he stated that his computer could not support the exercise.

The second problem I encountered with Google Drive OCR was trying to save the jpeg version of my OCR results to my desktop or to some other location that would allow me to import it into my practicum blog. I could neither drag nor copy it to the desktop from Google Drive. In fact, I could not locate a “save as” feature anywhere. Eventually, I copied the image and text results into a Word document and then used that file to import the image and text into WordPress. Once again, I struggled to perform what should have been a couple of routine computer tasks due to my inability to intuit the functions within the program.

In my review of the newspapers in “Chronicling America,” I chose to search for 19th Century American newspapers that mentioned the phrase “Lincoln assassinated.” I set my parameters for 1865 to 1865 (one year only), and I received numerous hits from both Northern and Southern newspapers beginning in April 1865 and later.  I reviewed carefully the pdf scans of the original newspapers and the OCR-generated text behind them.  Unlike my experience with Google Drive’s OCR results, the error rate was much, much lower. Like the document I used above, the newspaper text was generally clear save for a handful of selected letters and words that appeared smudged or that did not strike the paper firmly enough during the original printing process.  I engaged in some back-of-the-napkin quantitative analysis with a few paragraphs in each newspaper and discovered that, after counting both correctly captured words and those that were incorrect, I had calculated an error rate of roughly 12 incorrect words for every 100 scanned. Thus, this 12-percent error rate was much more manageable and gave me greater confidence that the newspapers would yield useful, accurate results for historians engaging in data mining, topic modeling, or indexing. Many of the incorrect words resulted from poor printing quality in the original; in one case, the tightness of the letters in relation to one another changed the word “bouquet” to “bonqnet.”

Since most of the primary sources I will use for my dissertation will be U.S. newspapers from June 1944,  I looked for an online archive other than “Chronicling America” that digitized newspapers using OCR .  Newspapers.com fit the bill, and I signed up for a free seven-day trial so that I could gain full access to the archive. I conducted an initial search for “6 June 1944″ and immediately saw scores of digitized front pages from D-Day. The thumbnails made scanning for examples extremely easy. I selected three front pages to test for OCR quality: the Daily Boston Globe, the Las Cruces Sun-News, and the Bakersfield Californian. The scan quality varied significantly from paper to paper, most likely because of the quality of the microfilm from which the newspapers were digitized.  For example, the image quality of the Daily Boston Globe was so poor that I was unable to read it even after zooming in as close as the program would allow. By contrast, the other two papers displayed much better digitized quality. After about 30 minutes of effort, I could not figure out how to view the OCR text behind each front page.  Therefore, I decided to test the OCR quality by conducting a search of a very prevalent word on each front page: “France.”  Even though I could make out numerous uses of the word “France” in the Boston paper, the search only highlighted one “hit,” clearly a result of the digitized original’s poor quality. The Bakersfield and Las Cruces papers fared much, much better. In each case, the search feature “hit” each cited instance of “France” that I could discern – a perfect record!

In the end, my experience with ProQuest Historical Newspapers and Newspapers.com restored my faith in OCR.  I am still not certain why the Google Drive OCR program produced such poor results.  Was I simply too inept to use it properly? Should I have somehow tried to improve the source image’s quality? Perhaps some skill is necessary in ensuring the proper digitization of documents using OCR. What is most important to me now is knowing that OCR-digitized newspaper archives exist out there that will prove both useful and reliable as I press on with my dissertation research.  In the meantime, I need to get on the stick and figure out how to use this software properly. I wish we would have had more time in class for the practicum so that we could have learned through each other’s mistakes.

Steve Rusiecki

 

 

Week Four Reading Blog: Deconstructing the Database’s Perks and Perils

Our previous weeks’ readings consistently reflected both enthusiasm and caution about the new and not-so-new digital tools available for all historians. In keeping with this message, the digital “database” has now become that next digital-history resource that can potentially elevate historical scholarship to a new level – “digital history 2.0,” according to James Mussell — while still retaining the potential to “scald” unwary and neophyte users.  But for me, the readings suggested that the digital database, if used properly and for selected purposes, can be a boon to the average historian — even the most digitally challenged of us. Yet in spite of my enthusiasm for the digital database, Lev Manovich’s characterization of the digital database as the ‘natural enemy’ of narrative resonates strongly with me. My own experience with Edward Ayers’s The Valley of the Shadow Web site gave me reason to believe that Manovich’s point has merit. Thus, the message to me is clear: Proceed with caution, and understand the strengths of weaknesses of the database before putting it to use .

The first message I took from the readings was the need to recognize  the true nature — both good and bad – of the digital database as described by Manovich. I am inclined to agree that a database is simply a digital archive in which each of its stored components, all holding equal status, depends upon the skill and intent of the user to unleash that database’s potential. But, as Patrick Spedding cautioned about the limits of the ECCO database, recognizing the shortcomings of those digitized holdings is absolutely essential to making effective use of them. Understanding the shortcomings of OCR and recognizing error rates in digitization, as pointed out by Simon Tanner in an earlier reading and which Spedding further underscored, are all critical factors in making effective use of the database.

In my own tinkering with the historical newspapers archived on ProQuest’s site, I learned quickly that each search for a key term produced not simply easy-to-digest results but actually a whole new database.  In other words, I engaged in the digital manipulation of a database that Manovich discussed in the simple act of saving my results to the “My Research” feature in ProQuest. In effect, I had created a new database from which I could potentially apply further searches with greater granularity.  But I stumbled a bit here, since I could not figure out how to conduct these more refined searches from my newly created database.  What I did discover, though, was that ProQuest’s search function queried the actual OCR-produced scans of the newspapers.  And instead of identifying complete phrases, the search produced results according to each word in a phrase. Thus, my new database became filled with needless hits based upon individual words in the phrase, most notably the preposition “of,” instead of a complete search for the phrase “invasion of Europe.” Even so, the results quickly narrowed the database for me and presented some discernible patterns. Most importantly, they packaged for me into a newly defined database the actual primary sources I needed with which to apply the historian’s traditional qualitative analysis. I kept in  mind Sean Takats’s concerns about the abundance of source material as I tinkered with ProQuest, but my ability to reconfigure and limit the initial database helped alleviate some of those concerns a bit. I was able to follow my “hits” directly to the OCR-scanned facsimile of the articles and assess them individually.  But I still found myself sorting through a lot of superfluous stuff. Thus, the act of sifting through numerous sources, some useful but many not, showed me that the traditional approach to researching history still applies — but the research part now seems much more efficient  thanks to the digital database.

The second point from the readings that grabbed my attention was the idea that databases can now allow historians to make subordinate points within a  broader argument without engaging in extensive — and possibly digressive – research. W. Caleb McDaniel described how some scholars used search “hits” to quantify and support “points that were secondary to their arguments,” but the danger rests in what Lara Putnam cautioned as “superficiality or topical narrowness.” I am inclined to agree, yet I find this use of the database particularly intriguing for my own dissertation research. For example, my focus will be on examining how the radio and print media portrayed D-Day as it was happening on 6 June 1944. But I wanted to explore as a subordinate matter the degree to which newspapers “talked up” the invasion in the six months leading up to the event. The idea of reviewing extensive six-month samplings of multiple American newspapers to support a smaller point contained in one or two paragraphs seemed quite daunting and not the best use of my time.  I liked the terms that Lara Putnam used to describe the possibilities of making transnational connections to historical arguments through searches among multiple databases — “side-glancing” and “term-fishing.”  These terms helped me to conceptualize how databases can enable the inclusion of subordinate points within an argument without necessarily crossing external boundaries as Putnam intended. In effect, the point made is strictly contingent upon a targeted — but hopefully not superficial — acknowledgement of another factor that bears directly on the core argument without the need for an in-depth examination of numerous primary sources. But “hope” isn’t a method; and, in spite of my attraction to the concept,  I’m concerned that such results may in fact make narrow, tenuous points that won’t withstand scrutiny. My greater fear is that historians more broadly may tend to rely on basic patterns gleaned from databases to make many of their key points. My own thinking here is incomplete, and I may have actually talked myself into the very pitfall that concerned Putnam — a tendency toward superficiality. Frankly, I won’t know how I feel about my own re-defined notions of side-glancing and term-fishing until I put them to the test.

In sum, I see databases as a great thing for historians, but I’m skeptical about characterizing them as a genre unto themselves. Manovich’s article makes an interesting case for the database as some new “cultural form,” but I can’t help but see databases (at least for the moment) as just another digital tool historians may leverage to help accelerate and enrich their research. In other words, a database is really just an archive without the dust.  And so I proceed with cautious enthusiasm!

Steve Rusiecki

 

Week Two Practicum Blog: Locating Digital History Sites on D-Day and the Media

A quick Google search of my dissertation topic on D-Day and the media nearly overwhelmed me with hits about the basic facts behind the Normandy invasion on 6 June 1944. Unfortunately, nothing I found suggested to me  an ongoing scholarly conversation about my topic. However, I found numerous examples of sites hosting digitized primary sources — specifically American newspapers from across the country and actual radio broadcasts of the invasion — that I know will prove extraordinarily useful to my research.

I began by searching for anything related to “D-Day and the Media.”  Surprisingly, the first result was for an organization, D-Day Media Group, which promotes music, film, art, and literature on behalf of the African diaspora.  I could not discern why the group chose to name itself “D-Day,” but my first guess would be to suggest a “global invasion” of African-produced and inspired art.  But the many, many links that followed this first search result highlighted numerous American and international news Web sites reporting on the recent 70th anniversary of D-Day. Clearly, the timing of my search had everything to do with elevating the status of these search hits to the first 20 or so on Google.  The commemorations of D-Day by these sites took many forms. The BBC, for example, had actor Benedict Cumberbatch reading aloud the actual transcripts of BBC  radio newsflashes while Quinnipiac University’s Web site advertised how the school had “Tweeted” facsimile images of D-Day front pages. While my research is clearly focused on the news media’s reporting at the time of the invasion, these commemorative efforts for the 70th anniversary essentially represented, as Roy Rosenzweig and Daniel Cohen might have suggested, Web-based examples of public and community history at work.  But these sites were extremely useful in supporting my broader contention (dare I say “hypothesis”) that D-Day remains today a seminal American memory of the war due largely to the media. Most notably, the use of period newspapers and old radio sound clips  to reinforce America’s (and the world’s) existing memory and perspective of D-Day was quite striking to me.

Despite altering the text of my search phrases to “D-Day and Newspapers” and “D-Day and Radio,” I still found no evidence of an ongoing scholarly conversation about how the media portrayed the invasion and that portrayal’s effect on American memory. A few hits highlighted the term of propaganda but only in the most extreme, negative sense (mainly Wikipedia). Actually, news reportage during D-Day was much more nuanced in this regard.

What I found to be most useful, or potentially useful, were the online archives that hosted complete audio files of the radio coverage on D-Day and troves upon troves of newspapers reporting the invasion in one way or another. Two of the most intriguing sites were “paywalled” and thus, for the moment, out of reach: www.archives.com and www.newspaperarchive.com. Both sites emphasized newspapers as a genealogy source, but they also carefully organized their archives by state, city, and township, which will prove very useful to me as I attempt to discern a “regional” (if it existed) flavor to the nature of the D-Day reporting.  For example, did the reporting on the West Coast, which was closest geographically to the war in the Pacific and whose state populations even felt directly threatened by the Japanese, elevate the D-Day invasion to top billing?  The answer is behind the paywalls.  Frankly, I did not expect to encounter such extensive online newspaper archives outside of what I have already sampled from ProQuest.  And, as expected, ProQuest’s sources did not “pop” on the Google search.

I targeted my next search to the place where I knew I would find an online trove of digitally archived newspapers and perhaps vintage radio broadcasts – the National Archives Web site.  I was surprised to see that, while www.newspaperarchive.com boasted newspapers dating back to 1607, the National Archives advertised only 1690 to the present, leaving 83 years of colonial and pre-colonial newspapers possibly unavailable online or listed as a holding.  Yet I was more surprised to find that while the Archives had digitized many newspapers (the OCR scans were of varying quality but the close-up feature was fantastic), many were simply listed as holdings accessible only on site via microfilm or in the document’s original form. The search engine was very good, though. The choices available to narrow one’s search — frequency of a topic, ethnicity of the target audience, etc. — seemed extremely useful and very relevant to my research.  I plan to spend much, much more time on this site in the future.

In the end, I found several digital-history sites — mostly archival in nature — that will help me immensely to locate the relevant and important sources I need to serve the cause of my argument.  At first blush, the online sources can seem overwhelming; but, after a bit of digging and classifying the sites, I found that I could screen for the most useful ones rather quickly.  This practicum was most productive!

Steve Rusiecki

Week Three Reading Blog: The Pitfalls of Digitally Converted Sources

If one message emanated loudly from this week’s readings on digitization, it was: “Be careful!”  Each author consistently offered a cautionary tale about the potential pitfalls of digitizing analog sources for use by historians and other scholars while simultaneously (but somewhat hesitantly) conveying enthusiasm for the vast opportunities offered by such digitization efforts. Even though Dan Cohen and Roy Rosenzweig proclaimed rather proudly that the “past was analog” and the “future is digital” (Cohen and Rosenzweig, 80), they readily admitted that the digitization path they advocated was truly ‘undiscovered country’ fraught with potential problems and unintended effects.  For me, the digitization of sources, particularly newspapers, represents a windfall in accessibility; but, as Ian Milligan cautioned, historians must know the strengths and limitations of the digitized sources they are using, must be transparent about these sources’ digital nature, and must understand how they were constructed from their analog source (Milligan, 566-567).

A great deal of skepticism surrounded one technique in particular: Optical Character Recognition (OCR).  Issues ranging from accuracy, cost, and digital portrayal seem to mark OCR as perhaps a useful tool for some very specific purposes (such as data mining) but not as a substitute for the source itself. I agree fully. My limited exposure to OCR-digitized newspapers in ProQuest’s Historical Newspaper Database suggest that although OCR-captured works may assist in locating broad trends with regard to selected words and topics, there is no substitute for the historian viewing the original newspaper as it physically existed  for its contemporary readers. This aspect of how people experienced the news is important to me because my dissertation concerns how radio and newspaper media constructed a collective American memory of D-Day (6 June 1944) at the time the event was occurring.  What excites me about OCR-digitized newspapers is not only the availability I mentioned earlier but the ability to conduct key-word searches to pick up from larger corpuses of selected American newspapers the degree to which those papers discussed or referenced the coming invasion in the six months leading up to 6 June 1944.  In fact, my brief tinkering with data mining in ProQuest’s newspaper archive has what may be a new feature (at least since some of the readings were published) that steered me toward an actual facsimile scan of the newspaper articles based on “hits” from my search. Unfortunately, the articles were isolated from the paper’s original layout, so I could not experience the text in the same context as the contemporary reading audience.

In the main, I see great possibilities for digitization — as long as digitized sources supplement other ways that historians locate and interact with source material. As someone who has spent long hours in the National Archives pulling from the shelves box after musty box of World War II documents,  I have always found great value in seeing and holding original documents with all their inherent imperfections and unique graphics. As Simon Tanner pointed out through numerous examples, OCR doesn’t do a good job in capturing text from complex layouts that use graphics and other images. Newspapers fall into this category, and seeing original papers and their often varying photographic quality (and even use of unbleached pulp paper) can tell a lot about how people might remember what the paper was reporting and how the images shaped their memory. I know that Sarah Werner facetiously characterized this approach as “nostalgic fetishizing,” but I think this type of historical research still has a place in the world of digital sources, especially since scholars like Marlene Manoff are starting to see electronic objects as material objects themselves (Manoff, 312).  The danger here is that some electronic permutation of a document or photograph – enhanced or refashioned — becomes an actual material substitute for the original analog version.  That concept gives me pause — and some reason for concern. Will historians find themselves relying almost exclusively on digitally enhanced sources that did not exist in the same form as when they were created? Isn’t that notion the very idea behind the term “ahistorical?”

In any case, and despite my long-standing penchant for using original analog documents as sources, I am most excited about the possibilities of accessing online a broader range of material through digitization. The seven or eight major newspapers now fully digitized and in ProQuest’s database are a great source for me to test out some quantitative theories about how heavily D-Day weighed on the public mind in the months leading up to the invasion and how the media might have portrayed the invasion as an “America-only show.” However, I am painfully conscious of the fact that the data corpuses of many other newspapers are limited and, in some cases, skewed toward certain populations. For example, I am still on the hunt for good representative samplings of newspapers that targeted specific audiences, such as African-Americans.  These newspapers’ perspectives on D-Day are highly relevant to my research. Personally, I have no problem digging for those tough-to-find sources in hard-copy archives or wherever they may be, but I share Ian Milligan’s fear that lazy historians won’t do the leg work needed to dig up obscure sources and will instead rely solely on what is available digitally. I found Milligan’s contention that only digitized Canadian newspapers were figuring predominantly as sources in recent dissertations to be a bit shocking.  His point dredged up for me fears that digitized sources — and only digitized sources — might represent for many up-and-coming historians the left and right boundaries of their research efforts.  Milligan’s concern contradicts Sean Takats’s fear that an abundance of digital sources is the real danger.  Frankly, I think what gets digitized and when will be the biggest problem.  Abundance probably won’t be the real issue.  In my own work in years past, I reveled in finding some previously undiscovered source — that “gem” – that added a unique insight or, dare I say it, “flavor” to the history I was writing. I would hate to see future historians discouraged from doing the necessary leg work to find those obscure sources that will build upon existing historical arguments in rich and informative ways.

Ultimately, digital sources will be (and are now) a boon to all historians — professional and amateur alike. The challenge will be not identifying what Bob Nicholson called “the digital turn” (we’re already there) but for historians to learn the strengths and weaknesses of the digital sources they use and to employ them accordingly in the service of their historical arguments.  But a digitized source should only be one type of source, not the only source upon which we as historians rely.  We still need to dig in those musty old archives to find those undiscovered “gems” awaiting the light of day.

Steve Rusiecki