Queen Victoria's journals--the work of a single person--amount to over sixty million words. By comparison, the 1611 Authorized Translation of the Bible has 785 thousand words, and the total word count of the seven books of the Harry Potter series is 1.1 million words. So imagine a close-reading purist specializing in the history of the British Empire in the Victorian era. The scholar would need to take 70 hours of work for an initial reading, not counting a single pause to jot down a note--for a single source.
All of
this underscores the practical value of machine-powered distant reading. Take
the example above, in which the Google Ngram viewer has shown the percentage of
books each year starting in 1800 that contain the phrase “social equality” or “racial
equality.” From 1920 onward, the relationship between these two distinct ideas
is unmistakable. It is also super stimulating to consider the theoretical
explanations for the peaks and troughs. Intuitively, the cataclysm of World War
II seems to have played a role in a dramatic increase in discussion of social
and racial equality. The trough between 1945 and 1953 followed by a wave that
peaked in 1972 matches near-perfectly to the grand narrative of the Short Civil
Rights Movement.
Data mining
is super valuable, but at present its immediately recognizable uses are limited
to heuristics (a way to discover lines of research) and stimulating
visualization (a way to illustrate a point). I’m a traditionalist and have been
involved in critical scholarship for more than a decade, so the following is
going to sound, well, overly critical. That said, I want to declare that I feel
extremely glad to be a student of history at this moment in history, because we
are in the process of significantly increasing our power to make sense of the
past. Still, equanimity requires some issues to be acknowledged.
As a
begrudging student of Jacques Derrida and Michel Foucault, I must point out
that the complexity of language. Perhaps the best illustration of this is the
simple fact that I am treating a topic that has been much-discussed by tens of
thousands, if not millions of people in the English-speaking world, and yet every
single sentence in this blog post is unique—not a single sentence placed within
quotation marks can be found using Google, besides this once the algorithmic spiders
have found this page. So, to take the example above, social and racial equality
may be expressed in innumerable ways—circumspect language, synonyms, ironic
expression, and slang all distort the measurement of the ideas that the two phrases
I searched for represented. Have Americans thought and written less and less
about liberty since 1800, or have they simply preferred to call it freedom?
Optical character recognition will get better, but undoubtedly whole wars have been missed because a computer thought they were wans. I’m honestly not too worried about that, because the distortions are probably statistically uniform across our various samples. My biggest concern is with the samples themselves.
Google
proudly declares that they have scanned more than 25 million books. Before the
Ngram analysis will be a source of reliable insights for me, I need to know
more about these 25 million books. How does the high proportion of scientific
journals that many humanists have raised concerns about affect the sample? Is
it geographically concentrated in particular areas besides correspondence to the
population density? All of these things are major potential issues.
Distant reading already has and will continue to empower us to understand our past--particularly the past century--but we need to keep the insights of the linguistic turn in our minds as we realize that potential.
No comments:
Post a Comment