Best Information: Five-Million-Book Google Database Gets a Workout, and a Debate, in Its First Days

Ngram, Google’s new searchable dataset of words and phrases from 5.2 million published books, got quite a workout on its first day. Within 24 hours after its launching last Thursday afternoon, more than a million queries were run.

Various Web sites have had fun with the new technological toy since its unveiling, running idiosyncratic searches on topics of interest. For example, Tablet magazine focused on Jewish topics. The Atlantic compared “vampire” and “zombie,” and asked whether “pen” is mightier than “sword.” And Jezebel played with terminology about sex and relationships.

On an enormous scale, the database is the kind of resource that humanities scholars are increasingly using for their research, the subject of a New York Times series. And scholars and other interested observers have vigorously debated the reliability of this sort of data, pointing out previous problems with Google Books, including mistakes in dates, misattributed authors and errors in the actual texts as a result of misinterpretations by the automated scanning devices that copy the books.

Geoff Nunberg, a linguist at the University of California, Berkeley, who has been critical of Google Books data, still has his complaints, as he outlined in a Chronicle of Higher Education article. But he conceded that the error rate is much improved in this dataset.

Jean-Baptiste Michel, who designed the database with Google, said by e-mail this weekend that the team recognized that including information with errors was worse than not including it at all, so all books that did not pass strict standards for accurate labeling and scanning were filtered out.

“That is why we end up working with 5.2 million books and not the whole 15 million,” Mr. Michel wrote. (The 15 million figure refers to the number of published books that Google has digitally scanned so far.) “These filtering algorithms took us over a year to improve to our satisfaction. Indeed, if we hadn’t worked on them, we’d have published our very first version of the Ngrams, totally unfiltered, back in 2008.”

Their methodology is explained in detail in the supplemental materials attached to the paper by Mr. Michel and his collaborator, Erez Lieberman Aiden, published in the journal Science.

For their paper, Mr. Michel and Mr. Lieberman Aiden based their research on books published in English from 1800 to 2000. “We do not consider that trajectories outside of English 1800-2000 are scientifically validated,” Mr. Michel wrote. “In particular, before 1800 there are just too few books: one does not have enough statistical power.”

So while you can search back to 1500 on the Ngram database, don’t try using the information you might find to win tenure.

Mr. Lieberman Aiden, who has a Ph.D. in applied mathematics, also addressed the criticism that no humanists were on the research team. “I don’t think this is a very fair criticism,” he wrote in an e-mail on Tuesday. “I studied philosophy at Princeton as an undergrad, got a master’s degree in Jewish history, and actually took a leave of absence from a Ph.D. program in Jewish history when I went to grad school in the sciences (I did not return).

“Two of our other authors, Joseph Pickett (Ph.D., English language and literature, University of Michigan) and Dale Hoiberg (Ph.D., Chinese literature, University of Chicago), are the executive editor of the American Heritage Dictionary and the editor in chief of the Encyclopedia Britannica, respectively; although not academics, they are certainly humanists of profound influence whose expertise directly bears on the contents of the paper,” he added. “Furthermore, we spoke with dozens of other humanists throughout the development of the project, as can be seen in our acknowledgments.”

You can read more about the researchers’ work at www.culturomics.org.

This entry passed through the Full-Text RSS service — if this is your content and you're reading it on someone else's site, please read our FAQ page at fivefilters.org/content-only/faq.php
Five Filters featured site: So, Why is Wikileaks a Good Thing Again?.

View the original article here

Best Information

Wednesday, December 22, 2010

Five-Million-Book Google Database Gets a Workout, and a Debate, in Its First Days

No comments:

Post a Comment