Monthly Archives: November 2014

Project: Text Mining D-Day

1.  A Media Invasion

How did the media — radio, newspapers, and magazines — construct an American collective memory of D-Day, 6 June 1944, that transformed that event into the defining act of America’s participation in World War II?  This question cuts to the core of what I want to explore in my dissertation. The media, always claiming to be “history’s first draft,” seemingly went to great lengths in 1944 to convert that “first-draft” trope into a “definitive” depiction of D-Day and what it meant to all Americans. According to John McDonough, D-Day was “Broadcasting’s First Invasion,” “preceded by a huge media buildup that gave broadcasters time to organize their sources and promote their coverage” (McDonough, 193-194). In McDonough’s mind, the media’s reporting approach enabled a powerful sense of the invasion as something mammoth and unprecedented in history. But how did radio, magazine, and newspaper correspondents and their respective news agencies construct  that memory of D-Day so that it became America’s watershed event of the war, an event that would later serve as the locus of all American World War II commemorations for 70 years? What language, images, and tropes did they employ?

In reality, the size and scope of the invasion, although extraordinarily impressive, paled in comparison to the efforts that America’s and Britain’s Russian allies were putting forth against the Germans on the Eastern Front.  Richard Overy rightly argued that the Red Army, not American or British forces, had broken the back of German land power and, ultimately, the Wehrmacht’s ability to resist  (Overy, 1). So why did America’s D-Day achievements come to supplant in the American mind the Soviet Union’s achievements in the East? And, for that matter, Britain’s and Canada’s participation in D-Day? Was it nationalism run amok?  I think the answer, at least for the American home front, rests with the American media of 1944, specifically newspapers published throughout the country.

Text mining offers me a unique opportunity to test selected terms throughout numerous newspapers as a way to understand how the newspaper media contributed to America’s collective memory of that event and its importance to the war.  Newspapers represented the best way to communicate directly to the American people a common message  across a broad geographical landscape. In fact, Benedict Anderson contended that newspapers connected communities  spatially through the simple ceremony of common readership (Anderson, 60-65).  Everyone was certain to be reading the morning, noon, and evening editions at the same time during his or her day,  despite the different time zones. The employment of specific language — either topics or terms — communicated through this medium had tremendous power over America’s view of the world.  The use of language can be powerful, particularly when employed as propaganda to promote a particular perspective. Yet even though the Office of War Information (OWI) censored certain images and statements, the American media was generally on board with supporting the war effort and, for the most part, stuck to the facts — good, bad, and indifferent.  Yet the simple repetition of “true facts” has the power to propagandize, if you will, any message to the exclusion of other perspectives. Thus, I am interested to see how certain selected terms — words selected for what they may connote — appeared repeatedly in newspapers in the six months leading up to D-Day and on D-Day itself.

2. Text-Mining Methodology

My methodology involves text-mining a selected corpus of newspapers from January 1944 to June 1944  to examine a broad range of news articles about the war for the repeated use of certain terms. These terms might shed some light on how the newspaper media consistently, or perhaps intermittently, used key words to bombard the reading audience with the specific themes embedded in these words. Unfortunately, the limitations of online newspaper archives forced me to use a corpus of newspapers from a single state — New York.  And, due to copyright laws still in effect for newspapers published after 1923, I had to settle for whatever was most readily available  Thus, the Fulton History Web site ( became my primary online archive. This site, ostensibly the personal project of one man,  uses optical character recognition (OCR) to create a searchable archive of scores of New York newspapers. The site allows users to download and extract from the OCR pages text versions of each newspaper — but with varying quality and accuracy.  The average OCR accuracy rate was around 50 percent, which limited the number of newspapers I could select for my corpus, because I found myself investing the bulk of my time in correcting the OCR. Likewise, many of the newspaper scans appeared to come from murky microfilm images, rendering many newspapers completely unusable. Given time constraints, I settled on three different newspapers with different dates to represent each month from 1 January 1944 until 6 June 1944 for a total of 18 discrete newspapers. In only one case, I used the same newspaper title twice but for different months. My final corpus represented newspapers from 13 separate counties stretching from upstate New York down to Long Island. Although many of the newspapers were  focused on specific regions within New York state, a brief survey of the front pages showed that many of these papers used Associated Press reports for their war news, thus ensuring that most members of each county were receiving some of the same news as others in the state but not always in the same format or level of detail.

In an effort to maintain a manageable number of terms to mine from the corpus, I selected certain nouns to embody the frequency of coverage of selected theaters of war (the European Theater, the Pacific Theater, the Eastern Front, and so on) and how often the word “invasion” itself was mentioned in the months leading up to 6 June 1944.  For example, I selected the names “Eisenhower” and “Montgomery,” both key leaders in the invasion, as embodying a variety of things: confidence in Allied leadership, status of invasion planning, and Anglo-American cooperation (Eisenhower had been named Supreme Allied Commander for the invasion in early January 1944) (Weinberg, 660).  In order to gauge the extent of invasion coverage leading up to 6 June 1944, I selected the nouns “invasion” and “France” to connote actual preparation for the cross-channel assault, status of invasion planning, potential timing for the invasion, a re-imagining of America’s spatial understanding of the war’s boundaries, and the introduction of what would soon be a new theater of the war. In all, I selected seven words to test for the January through May corpus and four words for the 6 June 1944 corpus.

The Voyant text-mining software was the primary vehicle for this project. Voyant allowed me to visualize the results in various ways — through word clouds, distribution graphs, or hierarchical lists based on frequency of use. My first effort before testing individual words would be to upload the entire corpus to Voyant to see what it revealed in the aggregate. Second, I planned to use the small corpus of 6 June 1944 newspapers to test selected words. Third, I planned to upload the January 1944 through May 1944 corpus to test another set of specific terms. Finally, I intended to use another tool, the New York Times (NYT) Chronicle text-mining software, to compare the same terms in that single newspaper for the same months as my larger corpus (January through May 1944) to see if the results were similar or differed in other ways.

3. Testing the Entire Corpus

My first text-mining effort involved the entire corpus of 18 newspapers in an effort to see what word frequencies proved most dominant without mining for specific words from my lists. I found the Cirrus word cloud to be the most revealing, especially after activating the Stop Word List for English to eliminate articles and prepositions. The word cloud appears below.

The word appearing most frequently in the corpus was “Mrs.” This title for a married woman, appearing 633 times out of a total of 127,866 words, suggested the significance of the activities that married women, whose husbands were presumably fighting overseas, had taken on during the war. The next most frequent word was “war” at 410, which is not particularly surprising given world circumstances at the time.  The term “German” came in at 175 compared to “Japanese” at 104, which suggests a greater news preoccupation with war efforts against Hitler’s Germany in the months leading up to the invasion rather than the more active Pacific Theater.  In one sense, this greater frequency suggests that the media may have begun to lay the psychological groundwork for the invasion by acquainting all Americans with the activities of, and threats posed by, the German war machine.  This frequency of use is also in keeping with the Allies’ agreed-upon Germany-first strategy for the war. The frequency of  “army” (177) and “American” (168) don’t appear to be particularly significant, especially in the absence of greater context. However, if attributable to the U.S. Army, then the frequency of use would be consistent with the fact that the Army was America’s largest military service (or department) at the time, encompassing both the ground and air forces. Finally, the term “invasion,” one of the terms I selected for specific text-mining, appeared a scant 141 times (“pre-invasion” appeared four times) out of 127, 866 words, far less than one percent. This result suggests that news features focusing specifically on the nature, scope, or scale of the invasion prior to D-Day were not as prevalent as one might expect or as the frequency of the term “German” suggested earlier. This result was particularly surprising to me, since I expected a greater frequency of use.

The Words in the Entire Corpus feature (embedded below) allowed me to dig deeper for other words that I expected to see mentioned more frequently but that appeared lower on the frequency hierarchy.

In particular, the “British” (114) and “French” (77), both major Allies in the cross-channel effort (although the French would not contribute significantly until well after the americans and British gained a foothold in France), received limited coverage. I expected the term “British” to appear more frequently given that nation’s significant (but somewhat smaller) contribution to the Anglo-American effort to invade France from across the English Channel. Most remarkably, “Russian(s)” appeared a mere 62 times, less than half a percent of all words in the corpus. Given that the Red Army was taking on the majority of all German land forces  in the East (about 175 German divisions compared to the 50 or so that would eventually fight in the West after D-Day), I found this result to be shocking. Clearly, the American media either did not recognize the scale, nature, or effect of the Soviet contribution or chose to ignore it in favor of emphasizing the more modest, but no less important, achievements of the Western Allies. This result smacks of American nationalism.

Before ending my analysis of the text-mining results from the entire corpus, I selected the Word Trends feature  to test one word — “invasion” — across the entire corpus (see embedded distribution graph below). The results were intriguing as evidenced by the near absence of invasion-related features in the newspapers starting in January 1944 and into early May 1944. The only spikes in usage appear for those newspapers in the corpus dating from mid-May and later. This evidence suggests a less vigorous effort in preparing all Americans for the coming invasion as John McDonough has suggested.

4. Mining the D-Day Newspapers

The second part of my Voyant project only involved a three-newspaper corpus — all from the day of the invasion, 6 June 1944. I was interested in exploring the prevalence of four specific themes that my recent research has revealed only in the editions published on D-Day.  The newspapers in this small corpus, the Nassau Daily Review (Nassau County), the St. Lawrence Plaindealer (St. Lawrence County), and the Syracuse Herald Journal (Onondaga County) all represented an interesting mix of both national and regional news, which further intrigued me.  How dominant might these themes be in regionally focused newspapers?   I chose four nouns to represent the themes I wanted to explore: (1) “Crusade, ” the word General Eisenhower used to characterize the  invasion in his message to the troops on D-Day, would connote the moral, spiritual, and religious nature invested in the invasion; (2) “Church” would further suggest America’s acceptance of the moral nature of the invasion and possible fear that Americans felt by filling church pews once news of the invasion emerged; (3) “Eisenhower” would suggest confidence in the Allied operation’s leadership and that the media was firmly behind the man heading the invasion; and, finally, (4) “Normandy” would re-orient the American public’s spatial sense of the war from just “France” in general to a more specific location in France, all dependent, though, on when the actual landing locations became public.

When I uploaded the small corpus, I checked the Cirrus Word Cloud to see which of the four terms I selected would appear.  Surprisingly, only one appeared, Eisenhower, with a corpus frequency of 24 (see Word Cloud below). This number was much lower than I expected, even for such a small corpus. Even British Prime Minister Winston Churchill’s last name appeared 30 times, six more than Eisenhower. If examined in isolation,  these two results might suggest to the American people that the real driving force behind the invasion was Churchill, which was not the case; he argued for years against the cross-channel assault until he could no longer stave off American demands to conduct the operation.  Less surprisingly, and based on earlier results when mining the entire corpus, the most frequently appearing term is once again “Mrs.,” suggesting the dominant role played on the home front by women as their husbands, sons, and brothers fought the war overseas.

When I mined specifically for the next term, “Crusade,” I was surprised to find only one instance of the word in use — and only in the Nassau Daily Review Star.  The Word Trends distribution graph provides a stark visual impact of this low frequency rate (see below), which further suggests that news of Eisenhower’s “Great Crusade” message had not yet made it to all editions on 6 June 1944 or that local editors elected not to include it in their specific editions.

When I mined for the companion term to “Crusade,” “Church,” I found a frequency in the Words in the Entire Corpus feature of 34 for “church” and 10 for “churches.”  These results were more consistent with the trends I have already seen in my research with newspapers at the national level.  And, to confirm the findings for “Crusade,” as evidenced visually by the Word Trends distribution graph below, “church” or “churches” appeared solely in the Nassau Daily Review Star at a frequency of 12.04 per 10,000 words. Was the reading public served by the Star more religious than the other two newspapers’ audiences?  These results from such a small corpus suggest that I will need to consider the religious inclinations of the target audience for regionally produced newspapers to ensure my analysis occurs in the proper context.

My fourth and final term for the 6 June 1944 corpus was “Normandy,” which would become the dominant geographical battle-space in the minds of all Americans for the next two months.  The Words in the Entire Corpus feature (see below) picked up three variations of the word in use for a total of 16 appearances.

The Word Trends distribution graph (see below) also helped to confirm the wider usage of the term “Normandy” beginning on the very day of the invasion. Each newspaper used the term, which suggested that America’s spatial sense of the global war would include a new piece of Nazi-occupied terrain upon which Americans would engage the German war machine directly and, at least for one population served by the Nassau Daily Review Star, fulfill a moral (and perhaps religious) mandate to free the French people and all of Europe from Hitler’s evil regime.

Strangely enough, the Nassau and St. Lawrence newspapers only mention the term briefly, 3.44 and 2.52 per 10,000 words respectively. By contrast, the Syracuse newspaper,  presumably with a much larger readership, mentioned “Normandy” 13.30 times per 10,000 words, a significant difference in coverage. Content size of the news coverage does not seem to have had anything to do with the word’s frequency, particularly since the Nassau paper has much more overall content than the Syracuse newspaper.

Overall, the text-mining effort in which I focused on four selected words in a three-newspaper corpus for one day, 6 June 1944, proved to be very revealing. Some results surprised me, such as the limited use of “Crusade” and “church” among all three newspapers. This result suggests that, at least at the outset, Americans did not necessarily cast the invasion in religiously charged moral terms, although characterizations of Hitler’s regime as “evil”  remained prevalent throughout the media.  I was particularly pleased, though, to see that my OCR-corrected text worked well enough in the Corpus Reader (see below) to allow me to scroll for the specific context of each text-mined word. My initial fear was that converting the scanned newspapers into text files had resulted in too much “format collapsing,” which had caused articles inside their respective columns to become entangled. In many cases, I was unable to untangle the text and instead worked to ensure that the words were properly spelled out so that Voyant would recognize them.  Thankfully, a lot of the “tangling” did not detract significantly from my ability to see the words in the context in which the newspapers were using them.

5. Mining the Pre-D-Day Newspapers

The third and final text-mining effort with Voyant involved the entire corpus minus the three newspapers from 6 June 1944 (see Corpus table below). I removed the three 6  June 1944 newspapers because this discovery exercise would focus on the appearance (or absence) of seven meaning-laden words that I wanted to test only in the context of the news leading up to the actual invasion.

The words I selected to represent specific themes or topics are as follows: (1) “Russia” and “Soviet” to indicate the level of coverage the Red Army was receiving in the U.S. news media leading up to D-Day; (2) “Eisenhower” and “Montgomery” to suggest the level of effort the media invested in “talking up” the two key invasion leaders and, perhaps, how these men were the right ones for the job (in other words, promoting confidence in the war’s leadership); (3) “Invasion” to help gauge the degree to which the media was laying the mental groundwork for the upcoming operation; (4) “France” to determine how the media helped its readers to re-imagine and redefine the forthcoming shift in spatial boundaries for America’s participation in the war; and (5) “Pacific” to determine the degree to which this theater of war may or may not have supplanted pre-invasion news leading up to D-Day. Naturally, I wanted to include many more words, but doing so had the potential to make the scope of the project unwieldy.

The key words dominating the corpus once again proved intriguing when the Word Cloud emerged (see below).  Once again, “Mrs.” dominated the frequency hierarchy at 500 followed by “Mr.” and “war,” both of which tied for second place at 285. These results offered no additional insight into what I discussed in earlier ‘reveals.’  One word that I did not consider for my text-mining list, “victory,” appeared 58 times. According to the Corpus Reader, the use of this word often appeared in the context of war news or editorials and, in some cases, advertisements.  But out of 94,168 words in the corpus, the fact that this word appeared in less than one-half of one percent of the 15 newspapers may suggest that optimism was not high and that the looming invasion and its imminent costs might have weighed heavily on the public’s collective psyche.

I began text-mining with the first two words in my list, “Russia” and “Soviet.” Given that most World War II historians in the academy today agree that Russia’s ground efforts in the East were central to defeating the Wehrmacht, I was curious to see how much attention that theater received from the American media. I chose these two words to represent that theater because my research has shown me that the U.S. media tended to refer to the Soviet Union most often with these terms.  The results of my search for “Russia” showed that, of the 62 combined hits for all variations of the word, “Russian” was the most dominant variation at 28.   Therefore, I used this version in the Word Trends feature to visualize the distribution of this rather small numerical result.  The results yielded a strikingly uneven distribution of the word’s use among the newspapers. In fact, the Schenectady Gazette of 1 January 1944 used “Russian” the most at 19.92 per 10,000 words.  Nine of the 15 newspapers did not mention “Russian” at all, a surprising result given that the Red Army had broken the 900-day siege of Leningrad in January 1944 and made other large gains throughout the spring of 1944 on the Eastern Front. These specific results suggest to me that the American news media, in the months leading up to D-Day, provided very little coverage of the most decisive front of the war. But why?

When I used my initial version of the word “Russia” to create another distribution graph, the results were radically different (see below). In this case, only seven of the 15 newspapers failed to mention Russia; and, strangely enough, the Mexico Independent of 13 January 1944 (covering the new Soviet winter offensive as evidenced by consulting the Corpus Reader) used the term slightly more often than the Schenectady Gazette (4.66 to 4.60 words per 10,000 respectively). Overall, these results reinforced my earlier assessment but also revealed that the most dominating coverage of the Eastern Front leading up to D-Day dealt almost exclusively with Red Army offensive actions in January 1944 and not during subsequent months, perhaps because news was slow during what proved to be a relative period of calm on the other Allied fronts.

Lastly, when I entered “Soviet,” I received 20 “hits” for the singular version of the noun and two hits for the plural version. Once again, the Corpus Trends distribution graph identified nine of 15 newspapers that did not mention the word at all. The most frequent use once again appeared in the Schenectady Gazette for 1 January 1944. However, a brief mention of “Soviet” appeared in the Brooklyn Eagle of 1 April 1944 in the context of possible Soviet peace negotiations with some Scandinavian countries (5.28 per 10,000 words).

The most intriguing aspect of these collective results is that these newspapers tended to refer to the Soviet Union as  “Russia” (or some variation thereof) nearly three times more often than “Soviet.” Perhaps the news media was reticent to identify more often than necessary the Russians through their nation’s political identity, Soviet.  “Soviet” likely reminded many savvy journalists that America was allied with a nation that had a track record only slightly better than Nazi Germany’s.  The Soviet political system, and that nation’s brand of communism more broadly,  was the antithesis of American democracy, and this knowledge possibly tamped down journalistic enthusiasm for reporting Stalin’s achievements on the Eastern Front. Granted, the corpus under examination is small and regionally focused, but I expected a little more coverage of the Russians given their contribution to the Allied effort.  But was this omission due to American nationalism? Or was it a deliberate effort to minimize the Soviet contribution because of that nation’s political system? America was certainly pushing lend-lease vehicles and equipment to the Russians, so we had a direct stake in their success. So the question remains: Why were the Russians so poorly represented in the media in the few months leading up to D-Day? Would more dominant pre-invasion coverage provide the right clues?

The next text-mining terms shifted focus onto the Allied leadership, both American and British.  I first mined for “Eisenhower” to determine how often the news media mentioned him in the months leading up to the invasion.  The results stunned me. “Eisenhower” (and the name’s plural variation) appeared only 12 times in the entire corpus! I expected a much greater effort by the press to promote the new Supreme Allied Commander as the right man for the job. Eleven of the 15 papers never mentioned him in the five months prior to D-Day (see Corpus Trends distribution graph below).  Only the Mexico Independent of 13 January 1944 mentioned Eisenhower with any frequency and strictly in the context of his appointment as commander of Operation Overlord. Aside from that news feature, the Supreme Commander received scant attention. Amazing.

Given the dearth of coverage received by Eisenhower, the American commander, I was even less optimistic about an abundance of coverage for Montgomery, the British general named to serve as the ground commander.  And I was right: Montgomery’s name appeared only twice — in the Mexico Independent of 13 January 1944 (in the context of Eisenhower’s appointment as Overlord’s commander) and in the Brooklyn Eagle of 1 April 1944 (in the very brief context of invasion preparations) (see graph below).  Thus, Eisenhower the American appeared six more times than his British counterpart, once again suggesting a U.S.-only focus in the newspapers but, more broadly, even less interest in the men who would lead the invasion. Stunning.

The next text-mining term, “invasion,” promised more specific results about the newspaper coverage of the invasion itself prior to 6 June 1944. I found those results to show greater coverage than the two invasion commanders, Eisenhower and Montgomery; but, given the looming nature of the cross-channel assault and its purported significance to Germany’s direct defeat, the results were remarkably low. The word “invasion” appeared only 44 times and an additional five times in other variations (anti-invasion and pre-invasion).  When compared mathematically to the 94,168 words in the corpus, the scale of pre-invasion coverage is negligible;  the word represents a mere .00052035 percent of all words in the corpus.  Six newspapers never mention the term in the editions I selected for the corpus. But less surprisingly is the fact that coverage of “invasion” begins to appear more often in April and May, principally because Europe’s fairer weather signaled to America that the invasion was imminent. In fact, the Niagara Falls Gazette used the word 12 times one month before the invasion in its 3 May 1944 edition as evidenced by the Word Trends distribution graph, which I set this time to reveal the raw frequency of words  Except for this example, I have been using the setting for “relative frequencies” in these graphs in order to gauge word frequency in a broader context (see graph below).

Next, I mined the corpus for the word “France” to test the degree to which the newspapers attempted (or did not attempt) to re-focus the American public’s thinking in geographic and spatial terms on what would soon become the new theater of war — the Western Front.  Seven newspapers mentioned “France” a combined 18 times; and, according to the Corpus Reader, these instances were generally in the context of pre-invasion preparations and pre-invasion aerial bombardment.  Once again, the Schenectady Gazette of 1 January 1944 used the word most often, largely due to the robust war coverage in that New Year’s Day edition (see graph below). Overall, the newspapers did little to prepare the American people physiologically for the shift in the war’s geographic locus.

The final term I mined in the corpus was “Pacific” to determine if greater coverage  of that more active theater in the months leading up to D-Day accounted for the dearth in coverage about the coming invasion. In fact, the Pacific Theater was quite active in the first half of 1944, and American troops were dying there daily.  The results in this case made sense.  The word “Pacific” only appeared 36 times; but, as the Word Trends distribution graph  illustrates, coverage of that theater was more universal through each newspaper in the corpus (see graph below). In this case, the attention paid by the press to an active theater of war instead of one that was gearing up to become active represents a possible explanation for why the invasion received such limited coverage in the months leading up to D-Day.

Overall, my text-mining efforts with the January through May 1944 corpus provided some surprising, and in some cases unexpected, results. Words and themes that I expected to see covered in greater relative frequency received little attention, even when considering the small, regionally focused nature of the corpus.  In the main, the active theater at the time, the Pacific, did in fact receive more coverage as expected. The invasion, however, did not begin to emerge as a consistent word theme until April and May — but in only in a very limited way.

6. Comparing the Voyant Results with NYT Chronicle Results

Context is everything, and I wanted to see if another newspaper with a more national news focus, The New York Times, might support or refute the results of my selected corpus. I chose five words — crusade, church, Eisenhower, Normandy, and invasion — from the 11 words I used in my Voyant text-mining endeavor to test in the NYT Chronicle text-mining tool, which creates a distribution graph for a selected word based on the total number of articles against the year of publication. I began with “Crusade” and examined the use of that word only in the 1944 editions. The screen-capture below is a representative example of  how I used the NYT Chronicle distribution graph for all the words.

NYT Chronicle (Crusade) The percentage version of the graph showed that .11 percent of all articles in 1944 contained the word “Crusade” for a total number, according the raw data version of the graph, of 148 articles. By clicking on a certain point on the graph, I was able to go directly to a listing of all the articles starting on 31 December 1944 and working backwards — an awkward way of presenting the articles. As I expected, the word came into use most often in the 6 and 7 June 1944 editions of the Times but in numerous separate articles (see screen capture below).  In this case, the one-time appearance of “Crusade” in the Voyant corpus of three newspapers for 6 June 1944 clearly underrepresented the word’s use on a more national media platform.

NYT Chronicle (Crusade Articles)

Next, I added the word “church” to the graph and received a .15 frequency-of-use percentage in 1944 — but out of an amazing 6,388 articles. Clearly, the word appeared many more times in 1944 than “Crusade,” but these appearances, according to a brief survey of the article listings, were mainly in the context of church announcements, births, weddings, and other events. The confluence of both words in the 6 and 7 June 1944 editions of the Times did not suggest that “church” saw any more use in the paper as a result of Eisenhower’s characterization of D-Day as “The Great Crusade.” Instead, I am more inclined to view these results in the context of the broader U.S. population’s religious and spiritual nature and the consistent importance of church and religion over time — and especially during a time of uncertainty due to the war. In effect, the term “church” did not spike in usage due to Eisenhower’s characterization of the invasion as a type of religious or moral crusade. Most Americans may have already felt that way about the operation.  These results support the Voyant text-mining results for “church” in the 6 June 1944 three-newspaper corpus. “Crusade” appeared once but did not seem to influence the appearance of “church,” which appeared 44 times in that corpus in one variation or another.

The next term I entered into the NYT Chronicle graph, “Eisenhower,” provided results that were markedly different from the Voyant results. When I mined “Eisenhower” in the three-newspaper corpus for 6 June 1944, his name appeared only 24 times, a small number given his status as the man around whom the invasion centered at that very moment. The larger corpus — January 1944 to May 1944 — provided even more meager results in the lead-up to the invasion, a mere 12 instances of “Eisenhower” in use during those five months. The NYT Chronicle corpus, however, offered results that showed “Eisenhower” in use much more often in the months leading up to the invasion; however, the results for the 6 June 1944 editions were similar.  In the Times’ 6 June 1944 editions, four articles featured “Eisenhower” at least once, which is consistent with the usage seen in the three D-Day editions from Nassau, St. Lawrence, and Syracuse.  However, of the 1,017 New York Times articles that mention “Eisenhower” in 1944 (.76 percent of all 1944 articles), approximately 306 articles mention his name in the months leading up to the invasion starting on 1 January 1944.  This result differs radically from the Voyant corpus results of 12 and suggests that nationally focused news paid greater attention to General Eisenhower in the pre-invasion months, most likely as a way to build him up in the minds of all Americans as the right man for the job. Oddly enough, the greatest use of Eisenhower’s name in the Times came in 1956, when the paper mentioned his name in 5.23 percent of all articles that year. Naturally, Eisenhower had just been re-elected president for a second term, but I found the press’s preoccupation with Eisenhower the president over Eisenhower the war leader to be quite intriguing (see screen capture below).

NYT Chronicle (Eisenhower)

I used the next  term, “Normandy,” more as a type of control word, since I expected its frequency of appearance in The New York Times to align closely with the three-newspaper corpus from 6 June 1944 that I used in Voyant.  The results identified a total of 1,542 articles that mentioned “Normandy” in 1944. Unfortunately, I encountered my first and only glitch with NYT Chronicle, which did not allow me to view article lists from 1 January 1944 to 9 July 1944, despite repeated attempts to do so on separate days.  Any attempts to gain access to that part of the list always timed out after long periods of spooling. Therefore, I could not identify the specific frequency of  “Normandy” in use for any of the Times’ 6 June 1944 editions for comparison purposes. However, I noted that 1,010 of the 1,542 articles that mentioned “Normandy” in 1944 appeared after 9 July 1944, which suggests that “Normandy” as a newly conceptualized expansion  of the war’s geographic space for the American public began appearing on or about the time that the 6 June 1944 editions appeared.  But, unfortunately, this result is not optimum and suggests some potential limitations in the use of some digital tools.

The fifth and final word I tested in NYT Chronicle, “invasion,” compelled me to convert the term into a phrase in order to refine my results. The term “invasion” by itself as a way to gauge the media’s emphasis, or lack of emphasis, on the invasion leading up to D-Day proved to be too imprecise for this digital tool. The term identified the word in 4,075 articles in 1944 (3.05 percent of all articles); but, when I looked for the word’s use in context by scrolling through the article listings, I noticed that many correspondents used the word to characterize island assaults in the Pacific. I did not encounter this same problem when viewing the articles from my January to May 1944 corpus in the Voyant Corpus Reader. Therefore, I refined the search in NYT Chronicle by expanding the word into a phrase, which the tool allows. When I entered “invasion of Europe” into the graph, I obtained much better results. A total of 248 articles (0.19 percent) that mentioned “invasion of Europe” appeared, 74 of which surfaced in the Times from 1 January 1944 until 5 June 1944. Those 74 articles represented 0.29 percent of the 248 articles. When compared to the 44 instances of “invasion” in use in the entire January to May 1944 Voyant corpus, the national coverage addressing the upcoming cross-channel assault proved to be much more dominant in the Times, a newspaper with a greater national focus, than in the more regionally focused New York newspapers.

7. Conclusion: What Did I Discover with Text Mining?

My experience text-mining with Voyant and NYT Chronicle proved to be both significant and extraordinarily revealing.  At first, I thought the regionally focused New York state newspaper corpus I used in Voyant would limit my ability to gain insight into certain questions that were nagging me. My intention had been to mine a more nationally representative body of newspapers. However, the regionally bound corpus of newspapers proved to be serendipitous.  In fact, my use of the NYT Chronicle as a yardstick against which to compare the results from the regionally focused (and very limited) newspaper corpus suggests starkly that regional perspective, even in a time of global war,  still mattered significantly in the newspapers of 1944 America. In other words, the regional newspapers complemented — but did not attempt to supplant — the wider global news coverage provided by nationally focused newspapers like the Times. The regional newspapers brought those larger world events into sharp relief by highlighting local participation in, and support for, the war effort. Features highlighting local men “in the fight” dominated most front pages.   These newspapers’ editorial staffs seemingly recognized that both radio and nationally syndicated organs like the Times provided the  broader world perspective while the regionally focused newspapers supplemented that news with more local fare, such as features on the actual local men and women who were contributing to the global effort.  In these cases, only a brief outline of the global news seemed necessary — as evidenced by the differences in text-mining results between the regional corpus and the Times.  These results were a revelation to me, especially since I expected to see little variation in the war’s coverage throughout all newspapers in the U.S.  Therefore, I must now keep in mind that when I draw from newspaper sources for my dissertation, I must maintain a balance in both sampling and perspective between local and national news instruments and the different charters under which each editorial staff functioned.  Without Voyant and NYT Chronicle, the significance of this distinction would have been lost on me and may have jeopardized certain interpretations of my primary sources.  I could not have anticipated a better result from my first text-mining effort than this striking realization. Terrific!


Primary Sources

Brookfield Courier, Brookfield, New York, 16 March 1944, Thursday, Vol. 69, no. 8 (Madison County)

Brooklyn Eagle, Brooklyn, New York, 1 April 1944, Saturday, 103rd Year, no. 89 (Kings / Brooklyn County)

Cato Citizen, Cato, New York, 6 January 1944, Thursday, Vol. LI, no. 9 (Cayuga County)

Citizen-Advertiser, Auburn, New York, 1 February 1944, Tuesday, Vol. 13, no. 3,398 (volume and number unclear) (Cayuga County)

Fayetteville Bulletin, Fayetteville, New York, 14 April 1944, Friday, Vol. 55-59, no. 15 (Onondaga County)

Fairport Herald-Mail, Fairport, New York, 6 April 1944, 73rd Year, no. 10 (Monroe County)

Gazette and Farmer’s Journal, Baldwinsville, New York, 3 February 1944, Thursday, Vol. LXXXXVIII, no. 18, Whole no. 2077 (Onondaga County)

Geneva Daily Times, Geneva, New York, 15 March 1944, Vol. 49, no. 243 (Ontario County)

Hilton Record, Hilton, New York, 11 May 1944, Thursday, Vol. 48, no. 6 (Monroe County)

Mexico Independent, Mexico, New York, Thursday, 13 January 1944, Vol. 83, no.2 (Oswego County)

Nassau Daily Review-Star, Hempstead Town, Long Island, New York, Tuesday, 6 June 1944, Vol. XLVI, no. 133 (Nassau County)

Niagara Falls Gazette, Niagara Falls, New York, 3 May 1944, Wednesday, Vol. LI, no. 41 (Niagara County)

The Otsego Farmer, Cooperstown, New York, 19 May 1944, Vol. LVIII, no. 30 (Otsego County)

Red Creek Herald, Red Creek, New York, 17 February 1944, Thursday, Vol. L, no. 52 (Wayne County)

Red Creek Herald, Red Creek, New York, 9 March 1944, Thursday, Vol. LI, no. 3 (Wayne County)

Schenectady Gazette, Schenectady, New York, 1 January 1944, Saturday Morning, Vol. L, no. 80 (Schenectady County)

St. Lawrence Plaindealer, Canton, New York, 6 June 1944, Tuesday, Vol. 38, no. 56 (volume and number not clear) (St. Lawrence County)

Syracuse Herald-Journal, Syracuse, New York, 6 June 1944 (Second Section) (volume and number unknown) (Onondaga County)


Secondary Sources

Anderson, Benedict. Imagined Communities. London and New York: Verso, 1983.

McDonough, John. “The Longest Night: Broadcasting’s First Invasion.” The American Scholar, Vol. 63, no. 2 (Spring, 1994): 193-211.

Overy, Richard. Why the Allies Won. New York and London: W.W. Norton and Company, 1995.

Weinberg, Gerhard L. A World at Arms: A Global History of World War II (New Edition). Cambridge and New York: Cambridge University Press, 1994, 2005.


Digital Tools

NYT Chronicle (

Voyant (

Week Thirteen Reading Blog: Are Digital Natives Just Immigrants in Disguise?

Danah Boyd’s skepticism about — and deconstruction of — the terms “digital native” and “digital immigrant”  have done much to strip away the false veneer attending those particularly loaded phrases. I’ve always been skeptical of these characterizations, especially after witnessing so many members of my own generation engage in what one might call “digital innovation.”  Many of the most useful digital tools and software we enjoy today have come from  people like Steve Jobs, Bill Gates, Roy Rosenzweig, and others who grappled in adulthood with the promises that technological advances offered society and the humanities at large.

For me, the distinction between “native” and “immigrant” is not simply generational but instead rests in terms of adeptness of use. Younger people are more patient, more adept users of new media and digital technology while those of us born in the 1950s, 1960s, and early 1970s are less likely to embrace it. Only when the world at large “tests” that new technology and deems it appropriate (and often essential) for everyday use do us “old fogies” embrace those new digital tools. Granted, younger people born in the 1980s and 1990s are not just users; they are also producing the innovative, graphically stunning video games that have flooded the market. But these younger generations don’t quite seem to “own” the market as fully as I had previously imagined; skill with a computer does not equal wisdom with a computer.  We label these younger folks as “digital natives,” but they certainly don’t have all the answers. I’ve always considered my son (born in 1985) to be one of those stereotypical “digital natives.” Thus, I routinely deferred to his digital skill in navigating the Internet responsibly while I faithfully fulfilled the role of “digital bumbler.”

But my hands-off attitude changed over time when I began to oversee and question how he was using the Internet and digital tools. I quickly learned that he just jumped on innovations like Facebook and Twitter (and some very violent online games) without questioning what they might or might not do to his online reputation — or to his psyche! His judgment in these cases was not always sound. My initial disinterest in supervising his use of these online tools stemmed from my deference to him as a “digital native.”  In other words, I told myself that “he knows better than I do the dangers inherent in these online digital tools.” Not so. He got himself into a pickle on more than one occasion, and I had to help pull him out.  In other words, he was an excellent “user” of digital tools, but he did so reflexively and without much forethought.

Thus, Danah Boyd is on target when she claims that the rhetoric attending these terms is not just inaccurate, it’s dangerous.  The “networked world” she mentions is fraught with politicized language and trapdoors that can snare the unsuspecting user — both “native” and “immigrant” alike (Boyd, 197). An inclination to use a digital tool or platform, as I have come to view “digital natives,”  does not automatically make one a discriminating user. Thus, older generations of less adept but interested users (like me) still have a role in guiding the generations growing up with this technology to use it properly and to recognize its potential pitfalls.

We must also keep in mind that many members of the younger generations are turned off to technology due to the numerous pitfalls they have experienced or that await them, such as the posting of unflattering images, online bullying,  false information, and the like. Perhaps these online dangers explain why Allison C. Marsh’s museum studies students (supposedly “digital natives”) demonstrated “little interest in the digital world” (Marsh, 279). Her students’ further inability to use a simple program like Omeka to build an online exhibit that flowed logically proved equally disturbing to her. Granted, innovative experiments such as T. Mills Kelly’s “historical hoax” class can help draw reluctant students from the “digital native” generations into a more discerning and responsible use of the Web and other digital tools. However,  such an approach is a two-edged sword, primarily because the thrill attending such a hoaxing exercise can result in students who later become “serial Web hoaxers.” In other words, it could create a whole new category of user — the “digital monster.”

What I find truly puzzling, though, is how none of the professors I have had at GMU have employed any digital innovation whatsoever in any of their classes. Of my 10 courses for the Master’s Program, only one course actually demonstrated (and for one class session only) how to use the Web to find online primary sources. Specifically, we spent the class  with a librarian explaining to us how we could find newspapers online using Pro-Quest — and that was it.  I find this state of affairs remarkably ironic given Daniel J. Cohen’s admission that digital history is what prompted the Virginia State Council of Higher Education to approve GMU’s PhD in History program as the “PhD with a Difference.”  The only class I have taken that involved anything digital is this one — HIST 696, Clio-Wired — and only as part of the PhD program. If GMU is the standard-bearer of digital history, why are the History Department’s faculty members not on board? Of my 10 classes for the Master’s Degree program, the traditional model of sitting in a circle to discuss the “monograph of the week” dominated the teaching approach. The digital tools available today allow for excellent visualizations and immersive experiences. I would have loved to have seen some PowerPoint slides with images from the period or even audio or video clips from documentaries and the like to supplement the in-class learning experience.  These are cheap and easy tools to use. I run an academic institution for the Army; and, strapped as we are for cash thanks to a gridlocked Congress, we still employ a variety of digital and audio-visual tools to enhance learning, most of which cost nothing. If we are to undergo a true “digital turn” in academic history, then our history professors should set the example, especially in the very institution that prides itself on leading the digital charge.

Steve Rusiecki




Week Twelve Reading Blog: Should History Become a Virtual Commune?

As a historian who has already generated for public consumption two books and a few articles, I have some very stark opinions about when open access of someone’s intellectual property becomes a virtue or a liability.  First, I am a firm believer in, and avid supporter of, the Budapest, Bethesda, and Berlin open-access initiatives described by Peter Suber.  I support the Creative Commons initiative as well. However, I am also extremely sensitive of the need to protect a scholar’s (or, more specifically, a historian’s) intellectual property from untimely, unauthorized proliferation and, frankly, hijacking by others.  Thus, the AHA’s embargoing guidelines for dissertations makes good sense to me. Ultimately, some practical factors enter into the equation very quickly when someone has “skin in the game” in the form of a published scholarly work, particularly when that work may represent a lifetime’s investment in travel, research, and intellectual creativity. Dan Cohen is correct when he states that historians, whose non-fiction works are most likely to encounter difficulties in terms of fair-use legalities,  need to apply more effort and brain power to figuring out how to straddle more effectively the open-access and intellectual protections fence.

Cohen and Roy Rosenzweig, in their chapter in Digital History (2006) titled “Owning the Past,” are absolutely on target when they explain that existing “copyright law does nothing to protect ideas, only their formal and fixed expression” (Cohen and Rosenzweig, 200).  Yet despite this statement, Cohen and Rosenzweig still advocate strongly for  open access in the proliferation of “informal” and “unfixed” ideas, a position that I find troubling, even though they advocate for the Creative Commons licenses. Published works, as both scholars admit, clearly enjoy more legal protection and are less prone to intellectual hijacking because the knowledge they contain has already been formally presented to the reading public and to a host of peer reviewers.  In other words, stealing ideas from a published work can result in potentially successful legal action and strong policing by the public at large.  For the most part, the reasonable person standard applies : no one likes a thief or a copycat and will normally “call out” such transgressors. Recent notable examples include Doris Kearns Goodwin and Stephen Ambrose, both of whom suffered professional embarrassment when caught red-handed quoting from published works without proper documentation.  By contrast, unpublished ideas sent streaming through the Internet under the auspices of open peer review or through blogging are likely to invite intellectual “thieves” who prey on that type of unfettered, ‘un-policed’  access.  Granted, the Creative Commons licenses offer a potential remedy for any and all online ideas, but I am not yet sold on the viability of such licenses. Have these licenses ever stood up in a court of law?

The scholars most at risk are PhD candidates, who spend years digging for untouched primary sources in order to fill a scholarly gap that, in itself, may have been very difficult to identify. Thus, I fully support the AHA’s guidelines for allowing students to embargo their dissertations for up to six years. This option gives them a fighting chance to get their finalized out work “out there” more formally and to enjoy firmer copyright protections.  Once a book is published, the knowledge and the ideas are now public. Even Judge Denny Chen’s ruling for Google underscores the importance of these protections. Chen recognized that the open-access efforts by Google of published works, works which already enjoyed copyright protections, provided “significant public benefits”  for advancing “the progress of the arts and sciences,” all without adversely impacting the rights of copyright holders” (Chen, Case 1:05-cv-08136-DC, 26) .  In short, Chen argued that these works were already protected and that their open-access proliferation on the Web was a benefit, not a liability, to the copyright-holder.  I agree fully. As the copyright-holder of two books, and someone protected by these same laws, I have no trepidation about Google making my books part of their open-access project. I wrote the books to share the knowledge, not to hide it. But I would never allow any of my unpublished works to be  disseminated in such a way, even with a Creative Commons license,  because I know, and have experienced, others preying on the ideas of fellow scholars.  Unpublished work has the potential to invite and not deter predators.

The other big problem in spreading ideas  openly without proprietary protections is the fact that some academic publishers operating on tenuous business models will be less inclined to publish ideas that have already made the rounds. Rebecca Anne Goetz even described that in 2005, she was led to believe that blogging her ideas online would likely hurt her career prospects.   More to the point, William Cronon’s statement that “several” editors from distinguished presses told him that sharing ideas online would affect publication decisions should send shivers down the spines of all PhD students.  Maybe, as Adam Crymble suggests, the real problem is over-reliance on the book form itself. But, until History Departments nationwide embark upon a revolution that redefines  (or re-imagines) how historians can present their ideas publicly  with clear protections while still meeting tenure-track requirements,  the book is here to stay.

Overall, open access is a great thing. As Judge Chen stated in his opinion, it gives new life to old books and old ideas.  But all ideas need some form of protection before they end up on the open circuit. Thus, the published form offers the most safety to a scholar, particularly because fair-use laws exist on such a sliding scale of interpretation and application.  Copyright laws at least serve as a bulwark against full-scale rip-offs.  I think all of us historians want our work out there, and we would be happy to see older scholarship engage more with newer scholarship. But none of these things should happen until the scholar is ready to provide that knowledge to the world in a formal, polished manner.  The digital, online history world is the new Wild West frontier of academia, and every scholar needs to be armed with a copyrighted “six-shooter” to avoid being exploited or to prevent a lifetime’s worth of work becoming some lazy schmuck’s magnum opus.

Steve Rusiecki