A graph showing the frequency of the use of the words "robot" and "android" in the HathiTrust corpus.

Exploring with Words: Working with HathiTrust data workshop

Yesterday I had the pleasure to attend the workshop about Working with HathiTrust data led by Digital Fellow Param Ajmera. In case you missed it, you can find an article about HathiTrust on the Digital Fellows Blog, Tagging the Tower.

HathiTrust is a huge digital library that contains over 17 million volumes, and for this reason it’s particularly good for large-scale text analysis. The advantage of using HathiTrust, as Param showed us, is that you can perform the text analysis on the website of the HathiTrust Research Center – HTRC, a Cloud computing infrastructure that allows us to parse a large amount of text without crashing our computers. And the best part – it’s free and you can create an account with your CUNY login.

Param gave us a live tutorial on how to select the texts we want from HathiTrust and create a collection that we can save and open on HTRC. He created a collection of public papers of American Presidents and used Topic Modeling to find the words that are most commonly used together in these texts. The result was a list of “topics”, groups of words that the algorithm gathered according to how often they appear together. This allowed us to make a comparison between different presidents according to the keywords in their public papers. The experience was very interesting – and with the perfect timing!

The part I enjoyed the most, however, was when Param taught us how to use Bookworm, a tool that creates visualizations of language trends over the entire corpus of HathiTrust. The result is very similar to the Google Ngram Viewer, but Bookworm has one advantage: when you click on a point on the line, you can see a list of the texts where the word appears.

Since topic modeling can take a long time (hours or even days) according to the volume of text you’re working with, I decided to experiment with Bookworm. Here’s my Ngrams:

  1. Being a sci-fi lover, I decided to check the frequency of the words “robot” and “android”. I was initially surprised when I saw that “android”, compared to “robot” had such a low curve. When I checked the texts that were used for the Ngram, I realized that the word “robot” appears in a lot of papers related to engineering, robotics, and information technology, while “android” is a term mostly used in sci-fi. If we look at the curve of “android” alone, we see that the word has a spike in the 1960s. Was it because of Philip Dick? Or Star Trek?
  2. Inspired by the Rocky Horror Picture Show and the TV show “Pose”, I decided to investigate the relative frequency of “transvestite”, “transsexual”, and “transgender” in the HathiTrust corpus. The first two terms sound pretty dated – and rightfully so, while the third one is the most commonly used now. As you can see from the graph, the use of the term “transgender” skyrockets starting in the mid-1980s, beating the other two at the end of the 1990s. Another thing I noticed is that, looking at the texts:
    1. “transvestite” is mostly used to describe a cultural phenomenon (for example in texts about literature, theater, cinema, or fashion)
    1. “transsexual” is used in medical contexts, for example in papers about gender dysphoria
    1. “transgender” is used in a medical context, beating “transsexual” at the end of the 1990s. However, it is also used in and institutional contexts like policies, guidelines, and social justice reforms.