Week Five Reading Blog: The Power of Text Mining and Topic Modeling

Since this course began, I found myself reading ahead in an effort to find those digital tools that might prove most useful to me in my research. The two techniques that excited me the most were text mining and topic modeling, each of which suggested to me ways of negotiating large swaths of newspapers related to D-Day and the months leading up to the invasion without spending days and possibly months sifting through irrelevant online or on-site archives.  Now I could compile the relevant sources quickly, but I also recognized that these mining and modeling efforts could not supplant my need to read every relevant primary source in its original form — as an OCR image or otherwise — in order to evaluate its content and context properly.  I was equally excited to know that these tools could spit out graphs and histograms that could add a visual impact to any assertions I might make from a quantitative perspective.  Once again, however, I am mindful  of the strong thematic thread running throughout this week’s readings and the readings from other weeks: know the shortcomings of these tools and proceed with caution.

The assertion made by Frederick W. Gibbs and Daniel J. Cohen that a tool like text mining can open gateways to further exploration rather than [serving as] conclusive evidence”  resonates strongly with me (Gibbs and Cohen, 74).  Like these two historians, I don’t see the results of a text or topic search standing alone as evidence. In some cases, the quantification of hits can help make broader points, such as Cameron Blevins’s foray into defining regional spatiality in certain newspapers. His extensive text-mining efforts allowed him to argue that newspapers “privilege . . .certain places over others,” thus creating an “imagined geography” for their readers (Blevins, 124).  The visual mapping he generated proved equally impressive and useful in driving home his point. But Blevins was careful to point out that the traditional skills of the historian — that is, close reading and contextual source evaluation — has never been more necessary as a way to avoid falling into the superficiality trap.  In my circumstances,  text mining will prove useful for determining the degree to which newspapers featured articles on key Allied leaders leading up to the invasion or, once the invasion began on 6 June 1944, privileged coverage of the Allied efforts in the West over those of the Soviets in the East. The benefit of these searches are two-fold: quantification on one hand and, on the other hand, the ability to read closely for context and meaning only those editions that yielded “hits”.  But I also agree with Ted Underwood that a basic search, which is how he seems to define text mining, will generally give you what you probably expected to find. But what about finding those unknown “gems?” I think the second tool, topic modeling, will help me in this endeavor.

I found Robert K. Nelson’s “Mining the Dispatch” to be a remarkably effective discussion of the possibilities, and potential pitfalls, of topic modeling, which he defines as a “probabilistic technique to uncover categories and discover patterns in and among texts.” What excites me about the prospects of this technique is that the pairing of words as a way to search for macro-patterns within specific topic areas can lead to new discoveries.  For example, in the context of my research, the editorials addressing specific aspects of D-Day in the months leading up to the invasion cover many, many different topics, most of which could be difficult to narrow down and categorize by hand. Topic modeling seems to offer me a way of focusing more quickly on the numerous themes embedded in both opinion columns, editorials, and other articles.  But topic modeling also seems to have the most pitfalls. Micki Kaufman suggests that topic modeling requires a specific skill-set that most historians lack. If Kaufman is correct, then many historians who struggle to master such a skill (or skills) may find themselves avoiding the tool altogether or grossly misinterpreting their results. Or, as Ted Underwood suggests, if programming skills become necessary because the project’s scope is too great, many historians may opt out. Frankly, after reviewing the  graphic portrayals of Kaufman’s analysis of Kissinger’s Memcons and Telcons using the topic-modeling software MALLETT, I had difficulty making sense of the images and what they were trying to tell me. They supposedly portrayed meaningful word correlations; but, in the absence of further explication, they simply puzzled me.  I guess my mind just doesn’t work that way. Nelson’s approach seemed more in line with what I would consider to be possible given my own skills, particularly the categorizing of topics into written categories.  Topic modeling is one digital tool that I want to try for my project, but I’m a bit apprehensive of what the results may yield and my ability to interpret them properly.

Overall, text mining and topic modeling intrigue me greatly. The only thing to do now is put them to use. But some big questions remain for me.  How will  I apply software such as MALLETT and Google Ngram viewer to the newspaper databases found on Newspapers.com, Chronicling America, or ProQuest? The more I consider the technical aspects of putting these tools into practical use, the more uneasy I feel. But isn’t getting out of one’s “comfort zone” the path to bigger and better discoveries?

Steve Rusiecki

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>