Working with HathiTrust Data Workshop

This past Thursday, October 22nd, 2020, I had the opportunity to attend the Working with HathiTrust Data workshop offered by Param Ajmera. HathiTrust Data is most similar to an online historical archive/library on the surface level. This workshop delves deeper into the applications of the HathiTrust digital library as a tool for text analysis and visualization. Having some familiarity on the different uses of historical archives and how they retrieve text through the utilization of key word searches, which the first portion of the workshop demonstrated how to do, I was unsure exactly of what set HathiTrust data apart from traditional historical archives.

Made available through the use of cloud computing infrastructure HathiTrust, like most digital archives, is able to retrieve text in bulk and then analyze said text with the sole purpose of creating visualizations based on topic modeling. Topic modeling, as explained in the workshop, is procedurally described as first creating a workset through input terms and chunking the texts retrieved into documents which are then used to create a visualization based on word usage. It is worth mentioning that HathiTrust uses programs called API’s (Application Programming Interface) used to retrieve and sort texts using your input terms, which are then organized and visualized by a program called Bookworm. A clear example used to demonstrate its use during the workshop was comparing ideological information present in three different President’s speeches. Using the input terms “nationalism” and “internationalism” we were able to see, through the information visualized on a line graph, the ideological prevalence of nationalism and internationalism over the course of 240 years (1760-2000) and the presidencies within that time period, with most of the discernible data being present in the final 120 years (1880-2000). The final part of the workshop provided some scope into the amount of information available on HathiTrust. Although, not partnered with different libraries, millions of titles are still available through partnership with many different academic and research institutions.

I believe that most of the HathiTrust’s utility derives from its streamlined process of predictive word usage and ease of use where Bookworm is coupled with text analysis. This process was described in a non-convoluted, easy to understand way that not only reaffirms the importance of historical review, but also how easily it blends with digital visualization. There is trove of information available through topic modeling and it presents itself as invaluable to the study of history. I believe, however, that this information is only accessible to the wider academic audience through clear, presentable visualization techniques, as texts to be gathered and hits on the usage of certain words can sometimes total in the trillions. HathiTrust’s use of API and Bookworm are two solutions to this problem that I am excited to see adapt with technology.