Text Mining Praxis Assignment: Fanon’s ‘Black Skins, White Masks” in French vs. English

For my text mining assignment, I wanted to see what would happen if I tried separately inputting a book in two different languages. In particular, I wanted to see if Voyant could capture/visualize any translation decisions or “glitches in translation” that may come up when a language is translated from one language to another: Would some words appear more often in one language than another? Would some words not translate clearly? Can translation decisions be captured and understood clearly through a tool like Voyant?

I chose to input PDFs of Frantz Fanon’s 1952 book Peau Noire, Masques Blancs (French translated to English as Black Skins, White Masks), and its 1986 English translation by Charles Lam Markmann. Both versions were downloaded from Monoskop. In both texts, I had the word clouds show just the top 125 words of each text.

From inputting the two translations into Voyant, the most noticeable was a metadata issue, in that Voyant indexed “_black_skins_white_masks” from the English translated version, causing the “word” to take up a lot of space in the word cloud. Underlines were an issue for a few terms in the English version’s word cloud, as well as inclusions of Fanon’s name as a term. These, I presume, are pieces of metadata hidden throughout the PDF in ways that I cannot trace easily through a simply “ctrl+find” on my Preview application.

In regards to the actual terms, there were clear discrepancies in the frequency of term usage between the English and French versions of the text (at least from the eye of someone who doesn’t know French). To name a few examples: “Noir” was used 339 times, while “black” was used 357 times; “negrè” was used 399 times while “Negro” was used 436 times; “blanc” was used 289 times while “white” was used 504 times; and “l’homme” was used 94 times while “man” was used 423 times. On the one hand, as someone who does not know French, there may be other words in French taht were used in place of, say, “man”, that just “l’homme” and may diagnose the large difference in usage between the two terms.

On the other hand, what is made clear through the word clouds is that specific decisions are made in the act of translation that shows the non-linearity and non-neutrality of the very act. While this is a relatively obvious and drawn-out claim made time and again, it was interesting to see it happen in front of my own eyes. Additionally, its interesting to think about how my understanding of the text may change if/when I learn French and read the text in its original language. This made clear to me why one may prefer different translations to others, and how specific terms may not only better depict a certain claim, but also perhaps historicize and contextualize these claims in ways that particular translations may not be able to communicate.

Overall I found this assignment to be interesting and makes me think about language as a technology/technique in and of itself: The non-neutrality of language, perhaps, as a way in which specific ways of knowing and understanding are brought to the forefront in ways inextricable from power—which is something that Sylvia Wynter has talked about before.

Word Soup! – Voyant’s Text Analysis Tool

I wanted to test out Voyant’s proficiency when it comes to using a text with multiple languages. To do this, I inserted various texts into the software: English, Spanish, and two texts with a mixture of both. Was Voyant able to 1. distinguish between the two languages and 2. make connections between words and phrases in both English and Spanish?

I first used Red Hot Salsa, a bilingual poem edited by Lori Marie Carlson. The text is composed of English and Spanish words adding authenticity to the United States’ Latin American experience. Voyant could not recognize, distinguish, nor take note of the differences in word structure or phrases. The tool objectively calculated the amount words used, the frequency by which they were used, and wherein the text, these words appeared. Another test consisted of a popular bilingual reggaeton song entitled Taki Taki performed by DJ Snake, Ozuna, Cardi B, and Selena Gomez. The system was able to again capture the amount of words and their frequent appearance. Yet, the way it measured the connection was through word proximity and in a song which repeats the same words and phrases, this measurement is not clear.

Finally, I decided on an old English text, one of my favorite poems: Sweetest Love, I do not Go by John Donne. Here I looked at the links tool and noticed the connection between the words die, sun, alive, and parted. The tool gave me a visual representation of metaphors inside the poem ( just because we are apart, we won’t die, like the sun, I will come again, alive ). I found the links section the most useful part of Voyant.

While exploring this tool, I recalled Cameron Blevin’s experience with text mining and topic modeling (Digital History’s Perpetual Future Tense). Like most of these digital apparatuses, one must go in with a clear intention prior to the text’s analysis and background. Without this, the quantitative measures will be there, but they will not have much meaning. They will become just Word Soup!

“…using computation, and reframing the scale of literary inquiry, are two distinct things.”

Given all of the possibilities digital text creates for address (Witmore), I really appreciated the pieces from this week that trouble text analysis by questioning the over-reliance on computational methods to conduct it. 

By re-centering the physical book and its susceptibility, as it regards analysis, Witmore and the essence of the Underwood quotation for the title of this post inspired me to think about the conversation both readings provoke about the limitations of text analysis (as it stands) and why it matters to society at large.

When I think about the salience of distant reading and lemmatization (or categorizing text) as a means to digital text analysis, I am reminded not only of the analytical flaws that are probable in research reliant on it, but that the aggregated processing of information feels predetermined in our data-world. It is not enough to point out the cultural nuances lost in analysis, but that the surveillance and capitalist-driven attitudes toward a rapid, digitized world was widely accepted as an objective route to information and readership alike. 

I can recall very few public discourses that speak to the value of the manuscript such as the one that emerged when Amazon’s Kindle became popular in the late 2000s. While much of it spoke profit margins, modernity, and access (versus institutional research and DH), a handful of arguments advocated for the physicality of books and the analytical processes unique to it that computational analysis is missing.

In advocating for a better addressability in text analysis to yield the qualities associated with analyzing the manuscript, Witmore insists that a phenomenological shift would have to occur: 

“We need a phenomenology of these acts, one that would allow us to link quantitative work on a culture’s “built environment” of words to the kinesthetic and imaginative dimensions of life at a given moment.”

Much like Drucker (2014), there is a cultural address that should precede and become valued by non/DH-ers that concerns the way we analyze textual information as a society — past and present— and that begins with our foundational relationship to text.

Text Analysis, Structural Power, and Structural Inequalities

Lauren Klien’s “Distant Reading after Moretti” offers a number of points of departure relating to the humanities as a whole, particular disciplines, and the general nature of computation. As an expanded version of our reading “Gender and Cultural Analytics: Finding or Making Stereotypes?”, Klien references Laura Mandell’s revelatory presentation in 2016 at the University of Michigan Library.  In the presentation, Mandell expands her analysis of gender and stereotypes to include discussions about Google and various OCR efforts.  Her exposures of major biases that completely distort interpretations and studies related to gender, and which occur at all levels of textual research and analysis from the problems of optical character recognition to the misuse of statistical techniques, argue for the importance of carving out an entire subfield or ongoing set of research initiatives dedicated to the critique of computation, along the lines of critical computation studies.

Not all schools of thought within the humanities may advocate for “connect[ing] the project of distant reading to the project of structural critique” or for actively supporting demands for social justice prompted by institutionally sanctioned practices of abuse and dehumanization within academic organizations. But given an academic landscape characterized by humanistic pluralism, scholars such as Klien and Mandell point to the possibilities of building forceful and lasting foundations of rigorous and critical scholarship for academic communities committed to socially engaged and progressive values, foundations which can serve as the leading-edge in the project of exposing and interrogating power.

As one example of how the debates about representativeness, statistical significance, and bias in textual corpora can help methodological critiques in other disciplines, the ways in which historians generalize with broad brush strokes using terms such as “everyone” or “no one” in relation to an entire polity or culture appear less defensible given the construction of albeit problematic digital archives totaling potentially billions of texts and artifacts. Important and crucial skepticism about the archiving and analysis of texts leads to important skepticism about generalizations and abstractions, whether theoretical or empirical, quantitative or qualitative.

In terms of the nature of computation in general, questions that come up include: Is there hidden performativity behind the act of enumeration that gives quantitative analysis the chimera of ideological prestige? To what extent if any do the socially constructed dichotomies between computational work and the traditional work in the humanities (or between the digital and the analog, etc.) reflect the functions class, gender, race, and education within the context of private capital accumulation? Ted Underwood and Richard Jean So underscore the value of experimental, iterative, and error-correcting models and methodologies in computational research. To what extent would a commitment to these approaches address the issues Mandell raises about the problems of text as data regardless of computational techniques?

Distant Reading, Race, and Gender

Topics top of mind from this week’s readings:

distant reading. 

historical context and perspective.

big data.

text analysis.

gender and text analysis.

race and text analysis.

sexual harrassment and misappropriation of power.

the gender and race docs were interesting (most interesting) to me this week; both strongly expressing that even distant reading can produce data that can be misread to the advantage of proving a prior theory. what should be undertaken (according to these authors), in the course of conducting this research, is an examination of how certain collections have come to be, and how they may represent falsely distinct collections. using the examples of “white” and “black” race or “male” and “female” gender, we can see that quite a lot is missing from these attributes. not only the affect of identity intersectionalities, but also the definition of what these terms mean in a cultural context from time period “x” to time period “y”.  

I thought the gender piece very clearly presented the case for more flexible gender classification and the case against a binary gender mindset and against the conflation gender and sex. the analogy of gender to genre was particularly successful as an aid. Likewise the race piece argued that while it’s possible to find clear commonality among white authors and among black authors, the numbers on closer inspection show the very slight percentage that makes up the difference, and the reason for the separation in texts’ indicators is not quite as clear as it seems when reading the high-level findings of the data. 

distance reading, text analysis can provide such a wealth of insights and with the expansion of computational power the possibilities for scholarly discovery are immense.  but this area of research going forward will be best served by attention to context, history, and representation in order to provide information that is as holistic and evenly considered as possible. 

Praxis: Text Mining

Earlier this week, I found myself thinking about pre-digital text mining a lot. Of course, the sorts of analysis we use tools like Google Ngram Book Viewer, Voyant, and JSTOR Labs Text Analyzer for have been in existence for far longer than the Internet, albeit in a different forms. Indeed, in my past, I’ve used much less advanced means to “mine” texts: during my junior and senior years of college, I spent quite a bit of time doing just that with Dante’s Divine Comedy, comparing different sections of it to one another, as well as to other texts. I’ve been told about and recommended Ngram and Voyant on many, many occasions in the past, but I’ve never had an excuse to explore them deeper. This project gave me the chance to explore not only those two, but also JSTOR Labs Text Analyzer.

I decided to perform a different experiment with each tool. While Voyant was described as the “easiest” of the tools, at first glance, Ngram drew my attention most due to the fact that it works of a predetermined corpus. This made me think about how reliable it could really be. At best, if one views this corpus as a “standard” of sorts, it could be used to compare multiple studies that used Ngram. However, I find myself wondering if information gained through Ngram could be potentially biased or skewed compared to information gained through other text mining tools that let you establish your own corpus.

In any case my first experiment wasn’t entirely in line with the project prompt, but I was curious, and I wanted to start somewhere. After reviewing Ngram’s info page, I decided to enter the names of a collection of 9 beings from different folklores and use the longest possible time range available to me (1500-2019). I made sure not to pick anything like “giant” due to the fact that “giant” is a word we use regularly with no attention paid to the associated creature. For consistency’s sake, I also didn’t pick any beings that are explicitly unique, such as gods.

It gets a bit cluttered at points, but I think it presents some interesting results, especially about Ngram’s corpus. It’s immediately clear that the basilisk has the highest peak and overall mentions of any being in this experiment, and even in our time, is only surpassed by two other beings. We also have a pretty clear second place, with the gorgon, who is just below the basilisk, even today. With regards to today though, the golem and djinn went from being fairly obscure to being most and second most popular.

With regards to the former two, the basilisk and gorgon this may indicate a very strong bias in Ngram’s corpus in favor of Cantabrian-Roman (basilisk) and Greek (gorgon) – that is, Western – pre-20th-century mythological texts. Alternatively, it may indicate that writers during those times were using those words more often in figurative language and metaphor (this may be completely arbitrary), or that more editions of myths regarding those creatures came out during those times. With regards to with the latter two, the golem and djinn, respectively from Hebrew and Arabic folklore, have received much more media attention as time’s gone on due to the growth and popularity of the fantasy genre.

The other 5 beings, from most to least frequent in mentions, are respectively from North American, Japanese, Oceanic, Germanic, and Orcadian folklore. The Tanuki has its first uptick in popularity around 1900, or just before the death of Lafcadio Hearn, responsible for some of the first texts on Japanese culture available to a mainstream English-speaking audience. Additionally, while the wolpertinger (Germanic) and nuckelavee (Orcadian) are Western myths, they’re the two least popular, which doesn’t coincide with the apparent bias toward Western mythology. Perhaps it’s specifically a Greco-Roman bias? I would probably have to conduct a more in-depth study.

Moving on, I tried Voyant next. I was excited to upload documents and even experiment with my own writing, but I was surprised to see that I could upload images and even audio files. The first trial I gave it was trying to get it to upload the graph I posted a bit ago, and then a MIDI and an .m4a file I found in my downloads folder. I was saddened to see this:

But I pushed onward and found that Voyant certainly tried to mine the audio file I fed it. Of course, it spat it out as a bunch of garbled text, but I might make a foray into the realm of trying to get practical use out of getting a text miner to mine images and audio in the future. For the time being though, I decided to stick with something more conventional. I tossed it a .txt file with a single haiku on it that I had lying around for a proper test run, and then got to my real experiment. I uploaded a folder of between 40 and 60 Word documents containing poems (certain files contained more than one poem) I wrote since 2013 just to see what would happen, and I got the following lovely word map:

After doing so, I realized there wasn’t a lot of data to glean from this that wasn’t self-indulgent, and at points, personal. Apparently though, I really liked the word “dead” (note: there WAS a poem in there about wordplay regarding the word “dead.” it’s a lot more lighthearted than you’d think). My next, and for the time, final usage of Voyant involved A Pickle for the Knowing Ones by Timothy Dexter: using the “Split Pickle” view of the piece offered by this site, I compared the original text of the piece’s first folio to the site’s “translated” version. For those unaware, Dexter dropped out of school at a young age and was largely illiterate.

Note: “pickle” refers to the original text, whereas “pickle 2” refers to the “translated” version.

Of note, in the untranslated version, the average sentence length is 333.6 words. With this corpus, something of note is that all of the “distinctive words” in the original text are… not really words. The fact that the original has 8 instances of “leged” while the translated version has 8 instances of “legged” is a testament to this.

Finally, I experimented with JSTOR Labs Text Analyzer. Here, I decided to go with an essay I wrote on the overlap between architecture and the philosophy of architecture, traditional narrative, and video game narrative during my senior year of college, and I was pleasantly surprised to the point of being slightly unnerved with the result due to its accuracy.

“Mathematical objects” absolutely includes architecture in the context of the essay.

There was a little less to experiment with on JSTOR, so I mostly left it at that. The number of related texts I was offered was certainly appealing though, and out of the three tools I experimented with, I think this text analyzer would be the one I would use for research, at the very least.

Biases in Distant Reading

In Richard Jean and Edwin Roland’s essay “Race and Distant Reading” they define distant reading as a process used to “describe the use of quantitative method to study large, digitized corpora of texts”. Basically the practice of this method is to analyze a large number of texts through a digital system, in order to find common textual patterns. The term was coined by literary historian Franco Moretti and has since been debated and critiqued in the field of DH.

The critiques in question have a common thread in the readings we had this week and honestly it’s not something I was exactly shocked to hear about. Biases against race and gender have long been an issue in the literary world. The nuances of both factors are often neglected and not accounted for and distant reading fully showcases that. With distant reading we don’t get the close attention to detail that these components fall under. Lauren F. Klein’s essay “Distant Reading after Moretti” explains that this seems to be a problem of scale- “they require an increased attention to, rather than a passing over, of the subject positions that are too easily (if at times unwittingly) occluded when taking a distant view” and this is where the problem arises. With a “passing over” way of analyzing texts we are left with clichés and stereotypes based off of assumptions.

Not only can these assumptions cause inaccuracies in results but in today’s society the idea of this way of thinking and organizing is just not plausible. Labeling things to fit the criteria of a certain race or gender is near impossible considering the social construct of both is always changing. Gender is no longer just an M/F category and thus should not be viewed as such. In Laura Mandell’s essay she references professor Donna Haraway in “Gender and Cultural Analytics: Finding or Making Stereotype?” and sums up how gender should be viewed in distant reading. One line that really stuck out to me is when she says that gender in writing should be defined as a “category in the making”…as a set of conventions for self-representations that are negotiated and manipulated” Similarly in “Race and Distant Reading”, Jean and Roland state that “the racial ontology of an author is not stable; what it means to be white or black changes over time and place”,they use author Nella Larsen as an example where today most scholars will identify her as black while in the 1920’s she was referred to as mulatta.

Furthermore it goes without saying that race and gender should absolutely not be the only signifier to allude to a writer in analyzing their identity. There are a number of elements that go into forming a person’s identity and such elements should be taken notice of. Klein suggests that exposing these injustices will help make the practice more inclusive, something I really hope takes off because in my opinion the concept of distant reading seems like something that can be of great use in the world of research.

Text Mining through HathiTrust

Having just taken part in the HathiTrust digital library workshop I chose to use HathiTrust and their analytical engine to do this text mining assignment. The first step was to find a collection of texts to analyze. Thankfully, HathiTrust has a tab specifically tailored to finding either ready made collections or forming new ones. I chose to use a collection of texts which used the word “detective” in published works prior to 1950. There were only 61 volumes in this particular collection, so it was a good place to start my experiment. It is worth mentioning that should you need to, HathiTrust allows you to make your own collections based on terms you search up. Using this collection I downloaded a TSV file which held all of the volumes in tabs according to the frequency of the word “detective”. HathiTrust’s analytics was located on a different site, where you can upload your TSV file and run algorithms to visualize a topic model of the term used, themes and frequency. Using the InPho Topic Model Explorer I was able to create an interactive topic bubble model. Attached is the link and screenshot of what the map looks like.


The algorithm groups clusters and color codes them automatically according to groups of topics and similar themes. If you follow the link you will see, for example, that the orange clusters are characterized by fictional stories, and mysteries in which the word “detective” is frequently used. If you use your mouse to hover over each bubble you will see these commonalities mentioned quite often. Also, it is easier to discern similar themes and topics if collision detection is turned off. The above image of the topic model is of the model with collision detection turned on. Below is an image of the topic model with collision detection turned off.

Hovering over the cluster with collision detection turned off makes it easier to see what themes the color coded clusters share.

Being curious about the trends of the terms “detective” and “crime” I then utilized HathiTrust’s Bookworm application to further this experiment. The Bookworm search does not use a specific collection, rather it searches HathiTrust’s entire digital library to display a simple double line graph of the usage of terms over time. The end result is a clear visualization of the the trends in term usage over a period of 250 years.

From this graph the most significant take away is the correlation of the uptake in usage of the terms “detective” and “crime” with the rising popularity of detective novels and media from the 1920’s on. Prior to 1920, as is confirmed by the InPho Topic Model, the term “detective” was used in mostly a non-fictional or informational capacity.

A downside to both Bookworm and the InPho Topic Model Explorer is that the texts themselves are not listed. To access the texts used one would either have to record every volume in a collection or look closer into the TSV file itself, as far as I know. Since I attended the workshop finding my way around HathiTrust was not too difficult, however, they are limited by the scope of information available through their digital library. Although partnered with several institutions, including the Grad Center, you may not have much luck mining more obscure terminology. Some terms include those with ethnic connotations or terms not consistent with Western academia, like terms with regional specificity.

After M*****i…

As others have mentioned in class, I also generally approach the readings for the week in the order they are presented in our course schedule. And I was excited to be learning more about distance reading: how it’s not necessarily an argument against close reading, seeing similar arguments about acknowledging “data as capta” and that there’s no such thing as an objective distance reading, but there’s room for more complexity and nuance. And of course lots of mentions of Franco Moretti, who originally coined the term.

Then I get to Lauren F. Klein’s “Distance Reading after Moretti.” Moretti has been accused of sexual harassment and assault, the details of which became more broadly known during #MeToo. After briefly acknowledging this, Klein goes on to discuss the ways in which a lack of representation in the field contribute to certain distance reading practitioners reinforcing problematic power structures. Klein says, “And like literary world systems, or ‘the great unread,’ the problems associated with these concepts, like sexism or racism, are also problems of scale, but they require an increased attention to, rather than a passing over, of the subject positions that are too easily (if at times unwittingly) occluded when taking a distant view.”

And then I get to two readings, both written after the allegations, that engage with Moretti and his contributions to the field: Laura Mandell’s “Gender and Cultural Analytics: Finding or Making Stereotypes?” and Richard Jean So and Edwin Roland’s “Race and Distant Reading.” My first thoughts are, can we separate the work from the person who performed it? (I’m not sure, but I’m skeptical.) And then what is the responsibility of a scholar to be aware of such abuses of the people they reference? My first thought was perhaps (being relatively new to the field and unaware of different scholars’ relationships to each other and to the general happenings within DH), maybe they didn’t know. Or maybe this is another issue where the publication date isn’t actually representative of the chronology in which they were written. But both of these articles include Klein’s panel discussion in their works cited. Admittedly both articles are critical of Moretti, but those criticisms are separate from the sexual assault. Do they have to acknowledge this? Does not acknowledging it let Moretti retain his privileged position within the field as being someone against whom discourse on distance reading necessarily has to reference and frequently appear as a starting point from which criticism must begin? I definitely wouldn’t advocate for erasing him from the history of DH and/or distance reading, but I’m not sure just being critical of his work is enough when it’s clear the authors of the latter articles are also familiar with the rape and harassment allegations.

Is this something that only comes with more time (and even if that has been the case, is that the example we should follow)? Is there reputable discourse on the “founding fathers” of the United States that doesn’t address their owning and raping of slaves? (Now that I’ve asked this, I’m fearing the answer may indeed be “yes.”) Medicine certainly still has many skeletons in its closet to contend with, but there has been a movement to rename conditions previously named after Nazi doctors. (I’m not sure this is an example to be followed, but naming medical conditions after people is incredibly problematic and also not very helpful in understanding what a condition really is anyway, so that’s its own can of worms…)

I’m not sure exactly what repercussions Moretti has faced, though I fear little. I found this article from the Standford Politics about him and another professor.

A graph showing the frequency of the use of the words "robot" and "android" in the HathiTrust corpus.

Exploring with Words: Working with HathiTrust data workshop

Yesterday I had the pleasure to attend the workshop about Working with HathiTrust data led by Digital Fellow Param Ajmera. In case you missed it, you can find an article about HathiTrust on the Digital Fellows Blog, Tagging the Tower.

HathiTrust is a huge digital library that contains over 17 million volumes, and for this reason it’s particularly good for large-scale text analysis. The advantage of using HathiTrust, as Param showed us, is that you can perform the text analysis on the website of the HathiTrust Research Center – HTRC, a Cloud computing infrastructure that allows us to parse a large amount of text without crashing our computers. And the best part – it’s free and you can create an account with your CUNY login.

Param gave us a live tutorial on how to select the texts we want from HathiTrust and create a collection that we can save and open on HTRC. He created a collection of public papers of American Presidents and used Topic Modeling to find the words that are most commonly used together in these texts. The result was a list of “topics”, groups of words that the algorithm gathered according to how often they appear together. This allowed us to make a comparison between different presidents according to the keywords in their public papers. The experience was very interesting – and with the perfect timing!

The part I enjoyed the most, however, was when Param taught us how to use Bookworm, a tool that creates visualizations of language trends over the entire corpus of HathiTrust. The result is very similar to the Google Ngram Viewer, but Bookworm has one advantage: when you click on a point on the line, you can see a list of the texts where the word appears.

Since topic modeling can take a long time (hours or even days) according to the volume of text you’re working with, I decided to experiment with Bookworm. Here’s my Ngrams:

  1. Being a sci-fi lover, I decided to check the frequency of the words “robot” and “android”. I was initially surprised when I saw that “android”, compared to “robot” had such a low curve. When I checked the texts that were used for the Ngram, I realized that the word “robot” appears in a lot of papers related to engineering, robotics, and information technology, while “android” is a term mostly used in sci-fi. If we look at the curve of “android” alone, we see that the word has a spike in the 1960s. Was it because of Philip Dick? Or Star Trek?
  2. Inspired by the Rocky Horror Picture Show and the TV show “Pose”, I decided to investigate the relative frequency of “transvestite”, “transsexual”, and “transgender” in the HathiTrust corpus. The first two terms sound pretty dated – and rightfully so, while the third one is the most commonly used now. As you can see from the graph, the use of the term “transgender” skyrockets starting in the mid-1980s, beating the other two at the end of the 1990s. Another thing I noticed is that, looking at the texts:
    1. “transvestite” is mostly used to describe a cultural phenomenon (for example in texts about literature, theater, cinema, or fashion)
    1. “transsexual” is used in medical contexts, for example in papers about gender dysphoria
    1. “transgender” is used in a medical context, beating “transsexual” at the end of the 1990s. However, it is also used in and institutional contexts like policies, guidelines, and social justice reforms.