Text Mining: Taking the Good with the Bad

Upon reading a preview of Professor Allen Romano’s work with text mining in Greek poetry, I appreciated not just his insight into patterns within ancient texts, but his insight into the place of analytical tools like text mining. Examples from his own work show how scholars can use a combination of text mining techniques and qualitative studies to see texts in a new way.

Earlier this semester, the digital scholars group discussed Google Ngrams. My initial reaction to Google’s tool was just that: it provides a new way to examine materials we may or may not already be familiar with. The tool, which “analyzes” Google Books, is controversial because some argue that “distance reading” texts won’t allow for accurate analyses. Researchers who released a study alongside Ngrams’ debut had their paper published in Science. This didn’t sit well with humanists who may have seen their overly quantitative methods and terming of the word culturomics as arrogant.

But should tools like Ngrams be dismissed so easily? Perhaps humanists are just put off by the limitations and supposed arrogance of Google’s tool, and they haven’t fully considered how text mining tools can be useful in conjunction with qualitative research. Romano does a good job of describing text mining, and explains how it is an extension of current, similar analyses (only much faster), and how patterns may not provide answers, but can give rise to new questions.

Of course, I’m sure there are plenty of scholars like Romano who are familiar with text mining already, and maybe they don’t find Google Ngrams so offensive. They probably already see the opportunities that Google Ngrams could provide in conjunction with good old-fashioned knowledge of a text. But I wouldn’t blame anyone for being disappointed by the limitations. For example, the search works best when books published after the year 2000 are excluded, and because of copyright issues, you can’t feasibly examine the contexts from which patterns emerge.

Granted Romano had more control over his situation, because he studied only Greek Poetry and had access to the original text of his materials, but I think he does a good job taking the limitations of text mining in stride. He describes it as a fresh way to view familiar works, a technique where nothing is presumed. It is different from say, browsing a digital archive with a question in mind.

Particularly interesting was the way Romano repeatedly framed the use of text mining. He says that text mining is not so much about finding rare cases of word use, but about finding patterns that emerge from the use of common words. He also says that while scholars may already be able to guess what the results of a text mining study may be, it’s seeing the markers provided by computers which is most helpful in viewing the text in a new light.

This last part reminds me of Google Ngrams again and how some might criticize it because it pretends to provide answers more definitively than it does. But I think it comes down to how you use the tool, and how well you apply the results to your current knowledge. If Romano had accepted a shallow analysis of his text, he might be positing that some prominent Greek heroes are actually females in disguise. Rather, in combination with further qualitative study, he can make more reasonable assertions; text mining helps scholars pinpoint places where, for instance, male Greek characters take on female speech characteristics. From there, they can make comparisons of several scenes across the corpus and examine what gave rise to those patterns and why they’re important.

Maybe we should all take a hint from Romano’s research and similar text mining applications, and start seeing Google Ngrams as a fun, potentially helpful, tool which helps us see books in a new way, rather than a tool that will spit out answers about the history of the English language.


