Week Three Practicum Blog: OCR and My Exercise in Frustration

Good Lord. Until this practicum, I thought I was fairly adept at manipulating computer programs. But to make parts of this exercise work, I felt like the proverbial one-legged man in an ass-kicking contest.  I feel more traumatized now than when I experienced ground combat for the first time. Just kidding — maybe. First, let me address my experience with the Google Drive OCR program.

I will admit that once I was able to open the older version of Google Drive OCR, the program proved rather easy to use.  I uploaded my assigned document from Prof. Robertson’s file (number 7) and ran it through the program. The instructions were easy to follow and worked perfectly. But I encountered two significant problems, one of which was practical computing (how to save the jpeg I created to my desktop) and the other the nature of the OCR results.  First, let me address the results. As you can see from the image below (sorry, but I could not figure out how to align it in the WordPress program), the typewritten script is very clear. Despite my efforts to crop the image and increase font size, I could not make the typewritten words any clearer than when they appeared in their original form.

OLYMPUS DIGITAL CAMERA

Next came the results. Wow. The digitized text following remediation appeared to be written in Klingon, Pig Latin, or some other unintelligible jib-jab.  I expected a far better result given the clarity of the typewritten page. Here’s what the program gave me:

ma charge of the translation. I admit Burns r0001″! money. I admit ho return! some monoy, and 1t Isl returned been“. by all and”.1 30 air, I didn’t.

A No sir. I think I got my information from tho audit department. I could explain to you how tint cam.

HR. BOARDMAR: is hero and be tho but proof of that. I think I will sustain the objection.

Q In your discussion with reference to thin lunatigaticn that yen wore making was there anything othor than innltigatm the birth of this 011116?

A Yu, than m the minus back of the pushing of the 0mg” against Thu. That m about all, an! to fin! out who :11 nor. inhrutoa Ln pronoun“; Thaw uni that their mutt?“ wort.

Q D16 you in flat connutlcn make my innitigatlon to “M out in or around I!” York in” how Ml “cap. I” uooonplilhoa and anything in commotion with um and how they In” trying to 301: Mm back?

As you can see, very few sentences proved coherent. In fact, in the absence of reading the original primary-source document, I could barely discern the topic of the interrogatory. I even manipulated the text after I imported it into the blog so that it would at least resemble the original in terms of format.  The error rate was well above 70 percent in this case, which shocked me. The digitized newspapers archived on the “Chronicling America” Web site fared much better, but I will discuss those results below. The impact of this error rate suggests that the OCR-generated text is marginally useful at best — and then perhaps only for indexing. Frankly, my confidence level in OCR dropped markedly after this exercise. I was unable to compare my results with Andrew’s, because he stated that his computer could not support the exercise.

The second problem I encountered with Google Drive OCR was trying to save the jpeg version of my OCR results to my desktop or to some other location that would allow me to import it into my practicum blog. I could neither drag nor copy it to the desktop from Google Drive. In fact, I could not locate a “save as” feature anywhere. Eventually, I copied the image and text results into a Word document and then used that file to import the image and text into WordPress. Once again, I struggled to perform what should have been a couple of routine computer tasks due to my inability to intuit the functions within the program.

In my review of the newspapers in “Chronicling America,” I chose to search for 19th Century American newspapers that mentioned the phrase “Lincoln assassinated.” I set my parameters for 1865 to 1865 (one year only), and I received numerous hits from both Northern and Southern newspapers beginning in April 1865 and later.  I reviewed carefully the pdf scans of the original newspapers and the OCR-generated text behind them.  Unlike my experience with Google Drive’s OCR results, the error rate was much, much lower. Like the document I used above, the newspaper text was generally clear save for a handful of selected letters and words that appeared smudged or that did not strike the paper firmly enough during the original printing process.  I engaged in some back-of-the-napkin quantitative analysis with a few paragraphs in each newspaper and discovered that, after counting both correctly captured words and those that were incorrect, I had calculated an error rate of roughly 12 incorrect words for every 100 scanned. Thus, this 12-percent error rate was much more manageable and gave me greater confidence that the newspapers would yield useful, accurate results for historians engaging in data mining, topic modeling, or indexing. Many of the incorrect words resulted from poor printing quality in the original; in one case, the tightness of the letters in relation to one another changed the word “bouquet” to “bonqnet.”

Since most of the primary sources I will use for my dissertation will be U.S. newspapers from June 1944,  I looked for an online archive other than “Chronicling America” that digitized newspapers using OCR .  Newspapers.com fit the bill, and I signed up for a free seven-day trial so that I could gain full access to the archive. I conducted an initial search for “6 June 1944″ and immediately saw scores of digitized front pages from D-Day. The thumbnails made scanning for examples extremely easy. I selected three front pages to test for OCR quality: the Daily Boston Globe, the Las Cruces Sun-News, and the Bakersfield Californian. The scan quality varied significantly from paper to paper, most likely because of the quality of the microfilm from which the newspapers were digitized.  For example, the image quality of the Daily Boston Globe was so poor that I was unable to read it even after zooming in as close as the program would allow. By contrast, the other two papers displayed much better digitized quality. After about 30 minutes of effort, I could not figure out how to view the OCR text behind each front page.  Therefore, I decided to test the OCR quality by conducting a search of a very prevalent word on each front page: “France.”  Even though I could make out numerous uses of the word “France” in the Boston paper, the search only highlighted one “hit,” clearly a result of the digitized original’s poor quality. The Bakersfield and Las Cruces papers fared much, much better. In each case, the search feature “hit” each cited instance of “France” that I could discern – a perfect record!

In the end, my experience with ProQuest Historical Newspapers and Newspapers.com restored my faith in OCR.  I am still not certain why the Google Drive OCR program produced such poor results.  Was I simply too inept to use it properly? Should I have somehow tried to improve the source image’s quality? Perhaps some skill is necessary in ensuring the proper digitization of documents using OCR. What is most important to me now is knowing that OCR-digitized newspaper archives exist out there that will prove both useful and reliable as I press on with my dissertation research.  In the meantime, I need to get on the stick and figure out how to use this software properly. I wish we would have had more time in class for the practicum so that we could have learned through each other’s mistakes.

Steve Rusiecki

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>