I came across a thoughtful article the other day in the Los Angels Review of Books called Literature is not Data: Against the Digital Humanities by Stephen Marche. Unlike those who are trying to program computers to write books, the digital humanities is something else entirely. It is a new and evolving field that is a sort of catch-all for a bunch of different humanities subjects that have gone digital.

Wikipedia has a pretty good digital humanities entry with lots of links to explore for those who are interested. But briefly, digital humanities includes things like digital libraries and archives such as the awesome Walt Whitman Archive and the Perseus Digital Library. It also includes amazing multi-media projects like The Valley of the Shadow that closely examines two communities during the American Civil War. In addition, some digital humanities researchers use computational methods to analyze large data sets (aka: texts that have been digitized). It is this latter approach that Marche is most concerned about in his article.

Marche pretty much blames Google for making the digital humanities possible. It all began, he said, in 2002 with Google figuring out the fastest, most efficient way to scan print books. It isn’t all Google’s fault though, Marche blames literary institutions too for being so deeply conservative that they allowed Google to take control. He accuses, “For at least 50 years, humanities departments have been in the business of creating problems rather than solving them.” Ouch.

Marche does acknowledge that Google was not the first to start digitizing texts. Early English Books Online, has been available for a decade. Far from being a good thing though, Marche sees it as a decline:

That wonderful database in its own way demonstrates how digitization leads to the decline of the sacred. Before EEBO arrived, every English scholar of the Renaissance had to spend time at the Bodleian library in Oxford; that’s where one found one’s material. But actually finding the material was only a part of the process of attending the Bodleian, where connections were made at the mother university in the land of the mother tongue. Professors were relics; they had snuffboxes and passed them to the right after dinner, because port is passed left. EEBO ended all that, because the merely practical reason for attending the Bodleian was no longer justifiable when the texts were all available online.

Calling it the “decline of the sacred” seems hyperbole to me. What exactly was sacred? The books or passing the port and the snuffbox? I am sure the Bodleian still has plenty of visiting scholars. What their texts being available online means is that those who could not previously afford to visit in person can now examine texts online. Depending on your purpose, online viewing might be perfectly sufficient.

What Marche is really upset about though is the data mining aspect of some digital humanities research:

Data mining is potentially transformative, more for its shift in attitude than for any actual insight it has generated. Some of its lexigraphical generalizations have been remarkably astute as philology, establishing scalable n-grams of word sequences over time. The problem comes when these generalizations are applied to literary questions proper.

But really, applying generalizations to literary questions is not done just by digital humanities researchers. Over generalizing is plain bad scholarship no matter what methodology is used.

Still, that’s not the heart of the problem. Marche asserts,

But there is a deeper problem with the digital humanities in general, a fundamental assumption that runs through all aspects of the methodology and which has not been adequately assessed in its nascent theory. Literature cannot meaningfully be treated as data. The problem is essential rather than superficial: literature is not data. Literature is the opposite of data.

This is true, literature is not data. But, some aspects of literature can be treated as data like creating those n-grams he mentions in an earlier quote.

Now, I am not a digital humanities scholar and I just flirt with the field around the edges, so I am no expert. I do know, however, that the field is not just about literature. It is called digital humanities which includes history and art and music and dance and drama among other things. Literature is just a small part of the field. And when I peruse digital humanities sites and journals that focus on literature, I have yet to see very many attempts at using data to interpret a text as a person writing criticism or theory might. Because that is what Marche is most worried about, computers theorizing about meaning.

His argument that computers can never do this is based on the literary record being incomplete and messy. This fact doesn’t seem to be a problem for scholars so I am not sure how his saying that there are nine different versions of Shakespeare’s Richard III goes against using computers. If anything, I would think that this is a perfect example of how computers might help scholars study all nine versions. Computers are much better at comparing and contrasting changes in a text across a set of texts, better at tracking word and phrase usage. Computers can do this much faster and with fewer errors than people can. But the computer doesn’t decide what the results mean, people decide what the resulting data reveals.

Yes, one of the dangers of data mining (for lack of a better expression), is, as Marche worries, a loss of context. I am sure as the field expands there will be problems with context and a host of other things that haven’t even arisen yet. But that doesn’t mean that what is being done is completely useless. It only means that scholars must be careful and watch for errors creeping into their research. This, to me, seems like something all good researchers are concerned with and is not exclusive to the field of digital humanities.

And now after spending all this time going through the article, I have no idea what the author was trying to do other than worry about what the digital humanities may do to literature in a worst case scenario. The worst case scenario — computers doing literary analysis — is not likely to happen. This is not to say that someone won’t try it, but it seems rather like the computer writing poetry from my last post. It might have something going for it but it will never have the understanding and nuance of human intelligence.

As for being able to access books, especially old books and manuscripts, online at anytime from anywhere, bring it on! I am not a scholar but maybe I love Walt Whitman so much that I want to spend some time comparing his written manuscript of a poem with all of its other iterations. Thanks to the Walt Whitman Archive I can do that without having to travel all over the country to different libraries and probably not being allowed to see some of the stuff anyway since I am not affiliated with a university. I don’t need snuffboxes and port and I bet in this time of tight budgets a good many academic researchers don’t either.

In case you haven’t figured it out, I find the digital humanities a fascinating field and I look forward to seeing how it develops. I expect, like any other field or method of research, there will be both good and bad things about it. Yes, we should talk about the bad things so we are aware of potential pitfalls. But focusing only on the possible negatives and blowing them up into big monsters does nobody any good. Mistakes will be made but so will discoveries.