Week Five Practicum Blog: The Power of Text Mining

My initial foray into text mining with Google’s Ngram Viewer proved rather exciting. For the first time, I was able to generate a highly useful visualization of one topic’s frequency over a period of time from a large online corpus of information — Google books. I appreciated the program’s accessibility and ease of use for the average user. Most importantly, the distribution graph that the viewer generated was easy to interpret and made sense at first glance.

I selected as my search topic the phrase “invasion of Europe” for the 100 years between 1900 and 2000. I chose English as the preferred language of my book corpus and “3″ as the smoothing feature. The viewer instantly generated an easy-to-follow distribution graph that clearly showed the expected spike in frequency for the phrase “invasion of Europe”between the years 1940 to 1944. The values on the left showed an initial low frequency of use  followed by a remarkable five-fold jump during the World War II years (1940 to 1944) as expected. The viewer even tracked two variations of the phrase — one with the word “Invasion” capitalized and the other with all letters in the phrase capitalized. Each of these variations had a separate graph that was well below the frequency of my initial search feature, most likely because the variations captured by the Ngram Viewer highlighted the less frequent use of the phrase as a title while my version, with the lower case “i,” suggested greater use in the content of the books themselves.

The Viewer’s most useful feature was the ability to scroll the mouse over the distribution graph and view the actual frequency numbers at specific points in time. And then below that graph, hyperlinks for year groups (such as 1940-1944) led me directly to the online documents reflected in the numbers that produced the graph. The graphing tool and the accompanying date links made sorting through the relevant and irrelevant texts rather easy.  They came up as thumbnails of the covers, which made for quick scrolling and recognition. Since my topic focuses on the invasion of Europe in the context of World War II, I was able to locate quickly original digital scans of Life magazines from 1940 to 1944 while quickly moving past “false hits,” such as a book about the 1853 Turkish invasion of Europe or a 1903 economics treatise chronicling the “commercial” invasion of Europe at that time.

Overall, this foray into text mining using the Ngram Viewer was very productive for me. The results of my first search are as follows:

Next, I sampled Bookworm using the “Chronicling America” corpus. Like the Ngram Viewer, I found this program very easy to use and navigate. Unfortunately, due to copyright issues, the corpus of newspapers in this database does not extend beyond 1920. Thus, it falls outside the time period that interests me — 1939 to 1945. And, unlike Ngram, Bookworm will work only with a single word and not a phrase, a significant limitation for me. In any case, I decided to use the term “invasion” to see what it would yield for me between the available years of 1840 to 1920.  The results are at the following link: Bookworm Chart.

Like Ngram, the distribution graph came up quickly and was easy for me to read and interpret. Remarkably, I identified several spikes along the x / y axis (books per million / publication year) for the word “invasion.” Like Ngram, I could scroll along the distribution graph and see boxes describing briefly the articles per year that represented “hits” for my text search.But what I found most useful was that I could click on the graph and go directly to the OCR version of the newspaper that registered the “hit.” Although the earlier 19th Century newspapers were difficult to read without an extreme close-up view, they all had a small arrow icon that popped up near the margin to direct me to the line or lines where the word “invasion” was mentioned. Very cool.

I decided to check the three biggest spikes (words per million) against the newspaper publication years to see what was actually happening that required journalists to use the word “invasion.”  The greatest spike was for 1840, and the context used for “invasion” centered on discussions of America’s various militias and those militias’ Constitutional role as defenders against invasion. The next spike came in 1861 in the context of the North’s invasion of the South during the Civil War, and the final big spike came in 1898 during the Spanish-American War and the U.S. invasion of Cuba.

Once again, I was very pleased with how this program functioned and with the graph it produced.  I would have found the program more useful if I could have searched with word pairs, thus perhaps narrowing my search even further. Using more than one word in Bookworm automatically creates a flat-line result on the graph. The results of my tinkering with Bookworm appear at the following hyperlink:  http://bookworm.culturomics.org/ChronAm/#?%7B%22search_limits%22%3A%5B%7B%22word%22%3A%5B%22invasion%22%5D%7D%5D%7D

The third text-mining viewer I sampled was the NYT Chronicle, which included all newspaper editions of The New York Times from just before the Civil War up through 2010. Once again, I found this software easy to use and the rapidly generated distribution graph easy to understand. And, like the Google Ngram Viewer, I could search the NYT corpus using my complete ‘phrase of choice’ — “invasion of Europe.”  The graph produced the expected spike over the war years, but the fact that the spike extended out to 1952 suggests that D-Day was a topic of discussion in the Times more than eight years after the invasion occurred.  In the context of my research,  this remarkable bit of evidence demands some scrutiny, since (as the results suggest) the invasion seemingly took on such a powerfully iconic image in the minds of all Americans that it potentially came to embody all that was good about America — sacrifice and justice in the face of an evil adversary — and thus a topic worth emphasizing to the Times’ readers well after the invasion and the war.  Granted, that assertion is a significant leap based on what essentially is a ‘distant reading’ of the texts, but the result really sparked my analytical imagination.  Text mining clearly has possibilities for my research.

In addition to the excellent distribution graph, I appreciated the scroll-over technique reminiscent of Ngram and Bookworm, but I really liked the direct link from the graph to scanned OCR versions of the newspapers’ original pages.  Additionally, the scroll-over feature on the distribution graph provided the percentage of articles per year, a feature not available in Bookworm’s “chronicling America” (as far as I could tell).  Yet I experienced the most difficulty with the program when attempting to access specific copies of newspapers and trying to save my graph results for importing into the blog. First, I was not able to figure out how to get past the pay-wall for accessing the newspapers. Granted, GMU allows such access, but I could not find a way to log in and obtain it. Next, I had a heck of a time trying to save an image of my graph. My computer’s Screenshot feature only allowed me to save the file in Microsoft Memo or some other goofy program.  Finally, I gave up and just grabbed the link as follows: http://chronicle.nytlabs.com/?keyword=abolition.invasion%20of%20Europe. It actually worked when I exited my browser (Firefox) and pasted it into another browser (Internet Explorer). Go figure.

My final experiment was with Voyant, which would not allow me to upload plain-text files  through Internet Explorer. Very frustrating.  But I took Prof. Robertson’s advice and downloaded Firefox and set it as my default browser.  After that change, everything worked like a charm. But I must admit that I found this program to be highly confusing at first blush. I uploaded the “magazine” and Oscar Wilde’s “novel” from Prof. Roberston’s Dropbox and, given that Wilde had a penchant for flowers (a small point I recalled from my English literature days), I decided to mine both documents (the novel was actually Wilde’s The Picture of Dorian Gray) for the word “rose.”  I typed “rose” into the search box below the text corpus, and it produced color-coded search results in the left margin of the magazine (upper level) and novel (lower level). But only in one or two cases did I find the word highlighted. For both documents, the Words in the Entire Corpus feature showed 38 hits for “rose” and another 22 for “roses,” a surprisingly small result given that one of the metaphors Wilde often used for the effects of a decadent lifestyle was that of the withering rose (or at least some other type of flower).

But when scrolling along the colorized search “hits” beside the text, I could not see any digital enablers (save for an occasional highlighted “rose”) that led me quickly to the “hits” for “rose.” I had to squint my way through several pages before finally finding the identified “hit.”  I thought the screen with the Summary title and its attendant word-frequency analysis for oft-used words in the two documents was extremely interesting. Likewise, the “Words in the Entire Corpus” window, which made selecting and exploring specific words at a click, was quite remarkable.  At first, the “word cloud” in the Cirrus window did not make any  sense to me until I scrolled over selected words and found the same frequency-of-use data available in the other windows I described.

Although interesting as an art form, I found the Cirrus feature’s word cloud to be a bit over the top.  The Word Trends feature was very useful; the distribution graph clearly showed that the frequency of the word “rose,” in whatever context it was used (verb or noun), trended higher in the novel than in the magazine.

Likewise, the Words in Documents feature separated the hits for “rose” by document (23 in the novel and 15 in the magazine),  a feature that would certainly figure highly in my own research.

After toying with Voyant for several hours over two days, I did not feel that it was as user-friendly as Ngram or NYT Chronicle. Frankly, I was not able to keep up with the in-class explanation of Voyant because I was too busy wrestling with the “upload” feature.  Therefore, I went online and printed a “Getting Started” guide that helped me understand what the different features in Voyant were telling me and how they functioned.  The guide proved very useful, and led me to experiment with pasting a URL for a corpus source into Voyant and seeing what happened. I linked the 1 October 1914 issue of the Arizona Republican newspaper from “Chronicling America” and hit “reveal.” The results were very good; the text populated the corpus feature very well, and my sample search for “Germans” (World War I was in full swing when this edition was published) worked nicely.

My standing concern (call it a “fear”) is that the newspaper databases that I want to use for my class project — specifically papers dating from 1940 to 1944 — are behind pay-walls, are password protected, or are filtered in some way that won’t make them accessible via this URL feature. Instead, I will have to located Web sites with the newspapers I need and “snatch” individual examples in order to create my own corpus. I find that prospect to be very daunting.  I tested this approach by grabbing three OCR versions of newspaper snippets from ProQuest’s “Historical Newspapers” database and uploaded them as pdf files. The results were poor. The text was garbled and unclear, and the various features in Voyant simply registered that gibberish.  I’m not sure that the quality of the OCR was the reason, but my confidence level in building my own newspaper corpus for Voyant is all but nil.

Ultimately, Voyant offers much more than the other text-mining programs, but I just need to learn how to use all of the viewer’s features and find ways that it can help me in my research.  Using Voyant with the databases I need seems to be the biggest challenge I have to overcome.

Steve Rusiecki


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>