Book-scraping

By Amanda on February 9th, 2009

Via several of my Twitter contacts: The Times Online has developed Book Scraper, a literary text analysis tool with 126 books in its database so far, from the 16th through the early 20th centuries. You can look at word clouds and lists of unique and particularly long words for each text (check out the long word list for Ulysses); you can compare two texts and see how much vocabulary they share, accompanied by a Venn diagram; and you can search individual words and see graphs of how often they appear, in which texts.

It has its flaws. As one friend commented on Twitter: "More books! And where's the API?" And the text analysis isn't perfect: it's clear from the Shakespeare page that stage directions and older spellings affect the statistics somewhat. But I like the word graphs, even though they're a bit skewed by the relatively small sample of texts in the database. Look at the graph for "amiable": a smattering of early 17th-century uses, mostly Shakespeare; then a set of giant bubbles from early- to mid-19th-century novels, with the heaviest concentrations in Jane Austen. I once wrote a short paper for an undergraduate Austen class on the word "amiable" in Emma. The Book Scraper graph is an interesting confirmation of how often that word shows up, not only in Austen, but in her near contemporaries as well.

It would be nice if the text set were big enough to do some higher-end data mining. But it's intriguing to see this kind of thing being done for an audience outside of academia. I wonder if this is a sign of literary data-mining going mainstream?

Categorized under: Books, Weblogs.
Tagged with: no tags.

Comments are closed.

lime tree bower dot net

Amanda L. Watson's home page

Book-scraping

Contact

Navigation

Recent blog posts

Blog archives

Post categories

Latest photo on Flickr