Wednesday, September 18, 2019

Data Mining



Queen Victoria's journals--the work of a single person--amount to over sixty million words. By comparison, the 1611 Authorized Translation of the Bible has 785 thousand words, and the total word count of the seven books of the Harry Potter series is 1.1 million words. So imagine a close-reading purist specializing in the history of the British Empire in the Victorian era. The scholar would need to take 70 hours of work for an initial reading, not counting a single pause to jot down a note--for a single source.


All of this underscores the practical value of machine-powered distant reading. Take the example above, in which the Google Ngram viewer has shown the percentage of books each year starting in 1800 that contain the phrase “social equality” or “racial equality.” From 1920 onward, the relationship between these two distinct ideas is unmistakable. It is also super stimulating to consider the theoretical explanations for the peaks and troughs. Intuitively, the cataclysm of World War II seems to have played a role in a dramatic increase in discussion of social and racial equality. The trough between 1945 and 1953 followed by a wave that peaked in 1972 matches near-perfectly to the grand narrative of the Short Civil Rights Movement.

Data mining is super valuable, but at present its immediately recognizable uses are limited to heuristics (a way to discover lines of research) and stimulating visualization (a way to illustrate a point). I’m a traditionalist and have been involved in critical scholarship for more than a decade, so the following is going to sound, well, overly critical. That said, I want to declare that I feel extremely glad to be a student of history at this moment in history, because we are in the process of significantly increasing our power to make sense of the past. Still, equanimity requires some issues to be acknowledged.

As a begrudging student of Jacques Derrida and Michel Foucault, I must point out that the complexity of language. Perhaps the best illustration of this is the simple fact that I am treating a topic that has been much-discussed by tens of thousands, if not millions of people in the English-speaking world, and yet every single sentence in this blog post is unique—not a single sentence placed within quotation marks can be found using Google, besides this once the algorithmic spiders have found this page. So, to take the example above, social and racial equality may be expressed in innumerable ways—circumspect language, synonyms, ironic expression, and slang all distort the measurement of the ideas that the two phrases I searched for represented. Have Americans thought and written less and less about liberty since 1800, or have they simply preferred to call it freedom?




Optical character recognition will get better, but undoubtedly whole wars have been missed because a computer thought they were wans. I’m honestly not too worried about that, because the distortions are probably statistically uniform across our various samples. My biggest concern is with the samples themselves.

Google proudly declares that they have scanned more than 25 million books. Before the Ngram analysis will be a source of reliable insights for me, I need to know more about these 25 million books. How does the high proportion of scientific journals that many humanists have raised concerns about affect the sample? Is it geographically concentrated in particular areas besides correspondence to the population density? All of these things are major potential issues.

Distant reading already has and will continue to empower us to understand our past--particularly the past century--but we need to keep the insights of the linguistic turn in our minds as we realize that potential.

No comments:

Post a Comment