Category Archives: Project Blog

Project Blog: A Peek Behind the Curtain When Using Voyant and NYT Chronicle

My experience with the digital text-mining tools Voyant and NYT Chronicle not only proved to be extremely productive but helped me to see the primary sources I intend to use for my dissertation — newspapers from January to June 1944 — in an entirely new light. As I stated in the conclusion to my project,  I learned that newspapers with a regional, or local, focus functioned in 1944 under a different charter from nationally focused newspapers such as The New York Times. My preliminary  analysis of several national and local newspapers before undertaking this project had given me the impression that local papers distilled through the Associated Press and other syndicates the same quality and content of reporting that papers like the Times offered. I was wrong.  My text-mining project completely changed my way of thinking about newspapers of that period. If I am to argue that the news media contributed to a singular collective memory of D-Day at the time the event was occurring on 6 June 1944, then I must do so by recognizing the uniqueness of the local newspapers’ contribution to that memory. I simply cannot approach the national corpus of newspapers at the time as one homogenous body of information.

What I thought would be the most limiting factor of my text-mining effort — a regionally focused and somewhat limited newspaper corpus from New York state — turned out to be a Godsend.  Since 1944 newspapers are still under copyright, I could only locate one Web site, the Fulton Library, that offered OCR versions of newspapers from that time period. Unfortunately, the scattershot cataloging of the newspapers on the site made locating specific editions very tedious and often impossible.  I intended to build a corpus of five newspapers per month from January to June 1944, but my difficulty in locating specific editions on the site forced me to settle for three per month for a total of 18 newspapers — a much smaller corpus than I wanted and expected. This corpus would ultimately support my endeavors with  Voyant. For the most part, though, the pdf scans of the newspapers were fairly good and clear, but I could not save them as any other type of file for embedding in WordPress. The hyperlink below provides an average example of the pdf quality, obviously taken from microfilmed versions of the original newspapers.

Red Creek NY Herald 1944-1946 – 0034

The real challenge came with correcting the OCR. Hoo boy! Correcting OCR took more than half the time I invested in the entire project, which I wanted to create solely in WordPress as a “born digital” product. When I used the right-click “Select Tool” to grab the news articles from the scans and save them in Word as “uploadable” text that Voyant would accept, I ran into multiple challenges. First, many of the articles that sat side by side in columns on the original newspaper collapsed into each other and became entangled. These ‘entanglements’ occurred about 40 percent of the time. Sixty percent of the time, though, the OCR appeared like the example below:

Lt. Alice L. Wood Tells
Of Arrival in India
Glimpses of Foreign Life Portrayed in Excerpts of
Letters Written Relatives Back Home
ito Player Turns Score
in Less Than Minute
to Play
> M B l u e D e v i l s j o u r n e y e d to
i l c o t t Monday n i g h t t o p l a y in
: s t a t e s e c t i o n a l b a s k e t b a l l p l a y -
T h e first g a m e o n t h e court
s Red Creek v s . Macedon, C
tools. The final score w a s 38
F o u r p e r s o n s w e r e i n j u r e d Mond
a y , Mar. 6, a t 1 2 : 2 0 p . ra. w h en
a c o a c h o p e r a t e d by L e o n ScoTille,
2 4 , of S o d u s v e n t o u t o f control
i n f r o n t o f ,/the F r a n k H e a g l e f a rm
l o c a t e d o n e m i l e e a s t o f W o l e o tt
v i l l a g e o n R o u t e 104.
P a s s e n g e r s in t h e S c o v i l l e car
w e r e M a u r i c e R o s s , 3 2 , o f Spring
G r e e n road, Woleott; Edgar
Ward, 53, of Red Creek and
G e o r g e G a l l o w a y o f S p r i n g Green
r o a d , W o l e o t t.
The ear was t r a v e l i n g east
G a l l o w a y e s c a p e d w i t h a f e w mino
r body bruises.
T r o o p e r Jack D o y l e o f t h e W o l e
o t t s u b – s t a t i o n o f t h e N e w York
s t a t e p o l i c e w a s i n c h a r g e of the
i n v e s t i g a t i o n .
Nation United in Raising
$200,000,000 to Carry
on War Job
2 1 i n M a e e d o n ‘ s f a v o r , ^^ ^ _
The C-M T r i – C o u n t y rf»*»PK>n when^it w e n t ” o f f t h e ‘ h l g h w a y on
“!L ^ J ^ ^ – J ? ? ^ a 5 S £ l t h e n o r t h « d e o f t h e ‘ r o a d , knoek-
- _.—.. ,„ . ^ down a mail b o x a n d g o i n g a
d i s t a n c e of 170 f e e t i n t h e field
p a r a l l e l w i t h t h e road. The mac
h i n e r o l l e d over five t i m e s bef
o r e i t s t o p p e d u p r i g h t f a c i n g t he
h i g h w a y .
R o s s w a s t h r o w n 30 f e e t bef
o r e t h e car t u r n e d o v e r t h e l a st
t i m e . He r e c e i v e d a f r a c t u r ed
c o l l a r bone and d e e p l a c e r a t i o ns
o v e r t h e l e f t e y e . Becker’s amb
u l a n c e f r o m Red Creek r e m o v ed
h i m t o Meyers h o s p i t a l , Sodus,
w h e r e h e r e c e i v e d t r e a t m e n t.
S c o v i l l e , t h e d r i v e r , r e c e i v e d a
i y n e c o u n t y c h a m p i o n s , b o t h i n!
u s B. The g a m e s t a r t e d off
t h a b a n g a n d a n y o n e w h o k n ew
y t h i n g a b o u t b a s k e t b a l l could
J i t w o u l d be a c l o s e a n d h a r d -
o g h t g a m e . House o f t h e Cato
u e D e v i l s scored t h e first two
i n t s o f t h e g a m e , b u t C l y d e irae
d i a t e l y came back w i t h two
ore p o i n t s . The g a m e c o n t i n u ed
t h e s a m e t e n s e back a n d f o r th
r u g g l e t h r o u g h o u t t h e first quarr.
At t h e e n d o f t h e first quarr
C-M l e d , 9 t o 7. In t h e s e c -
id q u a r t e r t h e C a t o B l u e D e v i ls
id a s e r i e s oo rf fl aa ss tt ssuuccccee ss safuuil ^ ^ ^ i ^ i e f t s h o u l d e r . Ward

Granted, the OCR was not so bad that I had to start from scratch, but correcting the words so that Voyant would recognize them became an incredibly time-consuming task.  Each newspaper generated an average of 40 pages of text in Word.  Whew.

Once I finished correcting the OCR, my text-mining efforts with Voyant proved to be exciting and intriguing.  Voyant was easy to use, and the Word Cloud, Corpus Reader, Word Trends graph, and other features functioned smoothly. I encountered some glitches with the Word Trends graph when trying to input my “saved” terms into one graph for comparison purposes. I could not get my favorites list to come up, but I did not really need to use this feature for my project, so I left it alone. The live embedding feature worked extremely well in WordPress. During my project, Voyant’s creators pushed out an update that changed which feature to select for embedding, but I was able to figure it out quickly.  I became so engaged in text-mining a variety of words that I had to step back and firmly define the boundaries of my project before everything became too unwieldy.

NYT Chronicle was equally easy to use and very enlightening. I used NYT Chronicle as a comparison tool to Voyant to check my text-mining results from the Voyant corpus of local newspapers to one with a national news focus — NYT.  This comparison provided the epiphany I mentioned in my introduction to this blog: regionally focused newspapers were radically, not slightly, different from nationally focused newspapers. The ability to select actual versus relative numbers of the words in use also proved helpful. Likewise, the ability to click on different points of the distribution graph to reveal a list of the articles in which the selected word saw use was superb. The only problem was that the list began with 31 December of each year; so, if I wanted to see an article from earlier in the year, I had to click through scores of pages sequentially to get to what I wanted to see. Very tedious. in once case, I encountered a glitch with the word “Normandy.” NYT Chronicle would not allow me to view articles including that word earlier than July 1944. I tried a few times on separate days and got the same result.  Yet this glitch did little to dampen my enthusiasm for what the tool produced. Like Voyant, it was an excellent and eye-opening tool that did much to change my view of my primary sources.

Overall, my experiences with Voyant and NYT Chronicle were extraordinarily positive. I tend to be suspicious about technology while still being open-minded about it, and I am happy to proclaim that these two tools did not disappoint me. I intend to use them well into the future. I certainly want to stay abreast of updates to these tools and re-assess how I can use the results in the future and in what context. Good stuff all around.

Steve Rusiecki