Analyzing Subjects and Intention (Text Analysis)

After reading the class assignments related to text analysis, I found the idea of analyzing large sets of text to be potentially exciting, insomuch as esoteric subjects are dependably vaguely thrilling on the outset.  Learning about text analysis and more specifically that practise called distant reading with a mind to eventually imitate something like it, seemed to me like peering into an exclusive clubhouse (which I can see in my mind’s eye as a cavernous hall furnished with ornate tapestry and filled with distinguished, singular types of people). How fortunate and how luxurious, to contemplate what vast amounts of characters might reveal once compiled and compared, much less set about the task itself! 

So I humbly undertook to use what was advertised as the simplest and easiest tool for the job — that platform called Voyant. My first idea was a bit grandiose, inspired as I was by the course readings.  I wrote in my notebook for a bit about some thoughts I had on this trend in many parts of my own life around the topic now called “DE&I”, or diversity, equity and inclusion.  

I thought it might not be too difficult, to pull some information down from the web for an analysis of the use of some key words and the percent change, if you will, from the 1950s and 60s (let’s say), to the 2000s and 2010s.  I thought I could focus this research around possible words used in children’s books, as I also had an idea that they might be more readily available than some other works (and more straightforward too).  

With children’s books in mind I altered the premise somewhat from “DE&I” per se (which is rather adult) to words that execute on that stratagem, such as “encourage”, “together”, “different”, “share”.  

But.  It is not practicably different to access children’s books as plain text vs adult books, nor is it particularly easy to in a short while comb through different kinds of these books to compare/contrast ones that will provide even something close to a coherent sampling for study. 

So I changed tactic and decided here we are in an election year, with a pattern of words basically doing our heads in.  Shall we compare the mention of key words at the front page of major newspapers on any given day, and can we draw from that any insight into a publication’s authors, audience, ownership, point of view?  What else could we see?  

A few well-known papers:

On the NYtimes.com: we can see that the focus in text is almost the same on Trump vs Biden,  and on seats gained vs lost.   

Washingtonpost.com: very similar to NYtimes.com.

WSJ.com (the Wall Street Journal) provides a different view, with more emphasis on “election” in general.  

With the understanding and disclaimer that it may be over-simplistic and not thorough at all to use the data visualizations only as proofs in this case, for the purpose of illustration here I copy/paste examples of the Voyant-generated word clouds for the sites mentioned. See below:

NYTimes

WashingtonPost.com

WSJ.com (Wall Street Journal)

A vote for a broader structural and institutional understanding of social media and disinformation.

Our sampling of readings related to social media and disinformation help to crystallize a couple of observations: a significant amount of helpful empirical work is being undertaken (some of which argues that not enough data exists to warrant generalizations).  Questions and research related to commercial advertising, media, and social media ownership point to useful lines of inquiry.  However, without a broader contextualization of albeit contested conceptualizations of social institutions and socio-economic structures at the system level that covers the longue durée, empirical work risks putting the cart before the horse or jumping prematurely to symptomatic correlation rather than comprehensive explanation. By placing empirical analysis and research findings regarding disinformation, misinformation, propaganda, and social media manipulation into the broader context of theories of capitalism, neoliberalism, liberal democracy, and mass media, we can avoid the pitfalls of incomplete research.

Perhaps the most fruitful analysis along these lines is the application by Jessica Ringrose of the logics of ‘aggrieved entitled masculinity’ within social media spaces. Building on Kimberlé Crenshaw’s theories of intersectionality, Ringrose points to the “relative degrees of privilege and oppression defined through access to structural power” as factors that help explain the support for racist, misogynistic, and rapist ideologies. Another approach is the attempt to develop propaganda models for mass media, such as the model put forward by Noam Chomsky, Edward Herman, and others. In parallel with empirical research, broader theories reveal how the problems of misinformation in social media are more deeply rooted than the content, the bad actors, and vulnerable communities; when populations establish polities based on a series of myths, the foundations themselves serve as breeding grounds not just for tyranny but for the propagation and acceptance of lies; depending on how we define actually existing democracy, its limited forms of self-determination may reflect deeper vulnerabilities that serve as drivers of some of the observations of empirical research. Explanations of disinformation in social media without problematizing broader frames of reference echo attempts to explain problems in K–12 education. Is it the teachers? Is it the school system? Is it the curriculum? Is it the family? Or is it all of the above the larger context of the political economy, the broader socio-economic imbalances enforced by regimes of capitalist controlled markets, and obstacles to fundamental constitutional reform?

What is (a) text? Another attempt…

As a followup to our class discussion, I attempt another provisional proposal of text as:

“a preserved or remembered piece of human language”

(My apologies for the raw and unstructured bullet points.)

  • “Preserved” also encompasses a piece of human language that is merely rememberable or remembered.
  • Texts are reducible to symbols (as described by Bianca’s expansion of semiotics and signs) that are predominantly but not exclusively mediated linguistically, i.e. communication facilitated by the larynx, the tongue, and the inner ear, as opposed to predominantly mediated visually with the eyes, and which have a linguistic grammar (morphology and syntax).
  • Visual media (which is increasingly audiovisual), while sharing symbolic foundations with text, operate through a non-linguistic symbolic grammar and are not text.
  • While imagery may not be, in and of itself, text, it would seem that all symbols and signs maintain a context which carries the possibility for narrative and some degree of textuality.
  • Class discussion exists as a text as long as we remember the class as a piece of human (linguistic) language. Conversely, a piece of human (linguistic) language that is not remembered (and only momentarily interpreted) is not a “meaningful” text, but may be an “ephemeral” text. This is not to say that ephemerality is not meaningful. It is simply to say that the transitory nature of communication and attention assigns meaningfulness to a ratio of signal vs. noise.
  • If any piece of language is remembered in the unconscious, pieces of language in the unconscious may become “conscious” text when they re-enter consciousness.
  • Texts encompass:
    • Any linguistically constructed (and linguistically interpreted) symbols before the advent of oral literature.
    • Any oral literatures, such as Vedic chants and homeric hymns before the advent of writing.
    • Akkadian cuneiform tablets of non-narrative numerical accounting.
    • Undeciphered scripts of pictograms, logograms, and ideograms such as perhaps the scripts of the Harappa and Mohenjo-daro cultures. As undeciphered scripts are deciphered the degree of textuality and linguistic interpretability increases.
  • The characteristic of being remembered (or preserved) bestows the possibility of some form of consideration however slight or however involved. It is this possibility of consideration, interpretation, or mental (or computerized) processing that brings a piece of (linguistic) language into being as a “meaningful” or “ephemeral” text.
  • Is a linguistic utterance momentarily a text during the instantaneous act of interpretation? The answer must by yes. Ephemeral texts are constantly and instantaneously coming into being and out of being as we communicate.  What is “ephemeral” may at any moment become “meaningful” depending on the extent of consideration.
  • If human (linguistic) language is implicated in the existential mess (in relation to the untold worlds of unnecessary suffering and enforced trauma), is it not reasonable to assume that human (linguistic) language (especially as text) holds the potential to help us find a way out? Following Kevin’s remarks on the implications of text in erasures and oppressions, text and visual media carry the afterglow of the misuse of human tools, which are in and of themselves neither harmful nor harmless.
  • Visual Imagery not constructed or interpreted linguistically is not text.
  • To the extent that non-representational art may exist, non-representational art is not, in and of itself text.

Are the following artifacts:

1. Text or Not Text (binary)?

2. Maintain degrees of textuality (non-binary)?

(in the sense of either the representational recordings or in the sense of the actual communication going on)

Can a specific set and hierarchy of symbols (of an empire) be considered a text?

Flags of UK and Its Colonies

Can both a logogram or an image of the Buddha be considered a text?

Buddha in Chinese and Image of Buddha Statue


Can the communication going on between ants be considered text?

Ants Communicating


Can the communication that goes on between a human and an animal be considered a text?

Communicating with a Dog

Text Mining through Harry Potter and the Sorcerer’s Stone

As a former English major I’ve read and analyzed my fair share of texts. Everything from mid century novels to Shakespearean plays pretty much encapsulated four years of my life. Although I appreciate the literary enrichment they have provided me they did not entice my literary curiosity as much as the Harry Potter series, in particular the first book. I guess I’m favoring more sentimental value over content value when I make this statement but much like the first praxis assignment I wanted to work with something that was of genuine interest to me and when I think of a text that does that nothing comes to mind more than “Harry Potter and the Sorcerer’s Stone”. I can read this book over hundred times and still feel like I’m being directly transported into a world of magic- a key term that I will end up exploring while text mining.

***Now before I get into my findings I want to put a disclaimer here about the author that created this book***- Unfortunately in recent years J.K Rowling has been known less for the books she created and more for her controversial and offhand remarks regarding trans individuals in particular trans women. I do not agree with her way of thinking when it comes to this topic at all and find her thoughts on the matter to be appalling and unacceptable. However in our last class we touched upon this idea of separating art from the artist in regards to their work. After giving this much thought I felt it was okay to go on using a Harry Potter book as the focus of my project as I don’t believe the legacy of these treasured tales should be sullied by the gross remarks of the author. With that being said I apologize to anyone I may offend, it is not my intention to do so. I come in with completely innocent intentions.

In looking at the tools that were suggested to us I decided to start things off easy and try my hand at using Voyant. The logistics of this tool is fairly simple and to the point. The homepage starts off on a box where you can input text, url’s or upload a downloaded file. I already have a full e-book version of “Harry Potter and the Sorcerer’s Stone” on my laptop that I downloaded from a website called passuneb.com- an e-learning platform that provides free educational resources to primary and secondary students. Although I am thankful for the easy accessibility and zero dollar charge that came with the downloaded e-book I was a bit irked by what I can only describe as water marks on each page.

(These two were on every page)

The repetitiveness of the website’s name ended up getting added to my generated corpus and mixed in with my results. I couldn’t find a way to omit it from my results but luckily I was able to take it out in my line graph showcasing document segments.

My cirrus word cloud visualizes www.passuneb.com in a larger font because the term appears 452 times, that’s more than the appearance of the names of most of the characters. In addition the line graph that Voyant came up with also featured the website’s name as well as the word said, both of which were taken out by me as I didn’t think they were relevant in what I wanted to see. Instead I wanted to focus more on the central characters names and how many times they appear in the novel. Voyant pointed out that the names Harry, Ron and Hagrid pop up the most with Harry having a total of 1,214, Ron with 410 and Hagrid with 336. From here I started playing around with the tool myself. I wanted to add Hermione to my document segment graph as she is a vital character to the novel. Her name comes up 258 times putting her right behind Hagrid in terms of character names. Adding her into the graph was easy as Voyant has a feature right underneath the graph where you can input the words you want to visualize. You can input multiple words or just leave it as one. Each character name was differentiated into a different color with the key above the map to show which color coincides with each name. Voyant also has a display feature which allows the user to add labels or change the style of mapping. For example instead of a line graph one can do a bar graph, however I felt the line graph was the most clear way to show the results.

After looking into the character names I was interested to see what would come up if I looked at how many times a particular word or phrase appeared in the novel. The words I chose to go with were magic and magical- it only makes sense for a book that’s based off of the existence of magic. In my findings the word magic came up 48 times while magical came up 11. I was a bit shocked the results were lower than expected but perhaps that’s my mind playing tricks on me as I must of convinced myself that the words came up more than they actually did each time I read the book. I guess this is why tools like this are so important when it comes to research, the human mind is not one hundred percent accurate at all times.

Nonetheless I still wanted to experiment more with these words so I decided to shift my attention to Google Ngram. In the search bar I input the words magic and magical and narrowed down the year search from 1990-2000. The words saw an increase starting from 1997 towards 2000, “Harry Potter and the Sorcerer’s Stone” was first published in June of 1997. I’d like to believe the introduction of this enchanting series jumpstarted the increase.

In conclusion I have mixed emotions about this first step I have taken into text mining. It was fascinating to say the least to be presented with these findings in less than a minute but I feel like the tools still have a few flaws. In Voyant’s case I fully understand why the they chose to put the website’s name in the results as it is featured in the text and is used often. Voyant did it’s job and emphasized it in it’s finding. However for aesthetic purposes I wanted to solely visualize specific things in my results i.e the character’s names and the word magic/magical and wish the website wasn’t as spotlighted. If I’m missing something and there is a way to take a word out of the cirrus then please let me know in the comments. As for Google Ngram, I felt the tool was easy enough to use but I was a bit disappointed with the lack of information that was provided to me. In other words I wish there was more to play around with on the site, perhaps features that allow you to change the physical appearance of the map other than the smoothing tool. Complaints aside this exercise has definitely opened me up to a world of research that I have not had the chance to experience before. I look forward to working with these tools some more in the future.

Embroidery showing stick-figure Buffy with crossbow and arrow that reads "Buffy will patrol tonight"

In Every Generation There Is a Chosen Text Mining Tool

From the moment the text analysis assignment was mentioned, I knew I wanted to do something with transcripts from the TV series Buffy the Vampire Slayer (seven seasons airing from March 10, 1997, to May 20, 2003). I had no idea what I wanted to examine or any question I might want to answer, but it’s a show I love and have seen many times, and I figured it would just be fun.

Text Gathering

I decided to do a “bag of words” comparison in Voyant using the Cirrus word cloud tool, and I originally expected to use each of the seven season finale episodes as my texts, assuming that a season finale encapsulates the overall themes from the entire season and I might be able to identify some kind of series-wide arch. But some finales were two-parters, so the amount of text being compared across seasons wouldn’t be the same, which I thought could be a problem. Inspired by our class conversation last week as we tried to answer the question “What is text?,” I realized there are several episodes of BTVS that play with some of the concepts we were discussing. So I decided to play around with the transcripts from select episodes instead to see what I might learn. I had wanted to take all of my transcripts from the same source, but this proved problematic as not every archive had complete transcripts for every episode. For the first two episodes I chose, I got the transcribed text from angelfire.com. For the latter two episodes I chose, I got the transcribed text from transcripts.foreverdreaming.org. None of these are official transcripts.

I initially copied the text for each episode into a word document, as I wanted to make sure there wasn’t any hidden metadata creeping into the text that might distort my visualizations/analyses. Some of the transcripts including stuff like “ACT I” or “Commercial Break,” both of which I removed. I was initially worried about the scene/stage directions being highly subjective as they were written by different fans (and not from the original scripts used in production) and also about the “Name: dialogue” format for the lines. But I figured most of this type of text also exists in other narratives like books when the author is setting scenes and defining who is speaking. However, when I put in the text for each episode, all of my word clouds were primarily just the names of the main characters and the most prominent secondary characters in each episode, which didn’t really seem very interesting in terms of analysis potential. So I then went back into each of the transcripts and removed the names of the four main characters (Buff, Willow, Xander, and Giles) as well as any of the other characters who are prominent in the series or were just prominent in that episode (e.g., Ms. Calendar, Angel, Spike, Anya, Oz, Tara, Dawn, Riley, Wesley, Jonathan, etc.). I also expanded the word clouds to include the top 155 words from each episode.

I Robot, You Jane

The first episode I chose is “I Robot, You Jane” (season 1, episode 8) in which a demon, Moloch the Corrupter, that had been imprisoned inside a book is released into the internet when the book is scanned as part of a digital archiving initiative at the school. Rupert Giles, the school librarian, gets into an argument with the computer science teacher Ms. Calendar, who is leading the initiative, at the beginning of the project:

Ms. Calendar: Oh, I know, our ways are strange to you, but soon you will join us in the 20th century. With three whole years to spare! (grins)

Giles: (smugly) Ms. Calendar, I’m sure your computer science class is fascinating, but I happen to believe that one can survive in modern society without being a slave to the, um, idiot box.

Ms. Calendar: (annoyed) That’s TV. The idiot box is TV. This (indicates a computer) is the good box!

Giles: I still prefer a good book.

Fritz: (self-righteously) The printed page is obsolete. (stands up) Information isn’t bound up anymore. It’s an entity. The only reality is virtual. If you’re not jacked in, you’re not alive. (grabs his books and leaves)

Ms. Calendar: Thank you, Fritz, for making us all sound like crazy people. (to Giles) Fritz, Fritz comes on a little strong, but he does have a point. You know, for the last two years more e-mail was sent than regular mail.

Giles: Oh…

Ms. Calendar: More digitized information went across phone lines than conversation.

Giles: That is a fact that I regard with genuine horror.

http://www.angelfire.com/ny4/amai/Buffy/s1ep8.html

Several scenes later, their argument continues:

Ms. Calendar: (exasperated) You’re a snob!

Giles: (incredulous) I am no such thing.

Ms. Calendar: Oh, you are a big snob. You, you think that knowledge should be kept in these carefully guarded repositories where only a handful of white guys can get at it.

Giles: Nonsense! I simply don’t adhere to a, a knee-jerk assumption that because something is new, it’s better.

Ms. Calendar: This isn’t a fad, Rupert! We are creating a new society here.

Giles: A society in which human interaction is all but obsolete? In which people can be completely manipulated by technology, well, well… Thank you, I’ll pass.

http://www.angelfire.com/ny4/amai/Buffy/s1ep8.html

The episode aired in 1997, and Giles’s character is generally portrayed as a technophobe throughout the series. His latter argument against technological innovation as good only because of its newness and against technology being necessarily the direction of “progress” reminds me of Johanna Drucker’s “Pixel Dust: Illusions of Innovation in Scholarly Publishing.” And Ms. Calendar is clearly trying to promote technology and digital archives as means to open access and inclusion. Does any of this come through in the word cloud? I think a little bit, with computer and book coming through as two of the highest used words (at 39 and 37, respectively). Cut is the most used word at 93, but I’m fairly sure that is because of stage directions.

Voyant Cirrus word cloud for BTVS season 1, episode 8.

Earshot

The second episode I chose is “Earshot” (season 3, episode 18). In this episode Buffy is fighting some demons, per usual, but something goes wrong when some bodily fluid from a demon is absorbed through her hand. She is infected with “an aspect of the demon,” which turns out to be an ability to hear other people’s thoughts. At first this is exciting as Buffy hears what other people are thinking of her and finds out interesting secrets people are keeping; however, the power grows and grows to the point at which it is driving her mad because she is hearing everything from everyone and cannot distinguish any of it. I thought of this episode when we were debating in class whether everything could be text and an example given was everything that people say, whether or not it is recorded or preserved. As we touched on text being understood through symbols, I thought words that we see only in our minds but do not share, either written or aloud, could also be text. This episode also felt like an apt analogy to what it might feel like for a human mind to analyze text on the same scale as a computer. What does this word cloud show? Looks, look, and looking are all very prominent, as are thoughts and demon. Cut again is very high, but this again I think is attributable to the stage directions.

Voyant Cirrus word cloud for BTVS season 3, episode 18.

Hush

The third episode I chose is “Hush” (season 4, episode 10). In this episode, the town of Sunnydale is the setting for a gristly fairy tale, where the Gentlemen come in and silence everyone so that nobody can scream as they harvest the requisite number of hearts to terrorize humanity another day. So here we have an episode (and text) that is essentially quiet, with communication limited to succinct phrases and crude pictures easily written/drawn on portable white boards and the nonverbal—body language and pantomime. Are these forms of communication text? I’m inclined to say yes as ultimately an entire storyline is conveyed to the audience just as it is for any other episode of the show. Given that this episode has the least amount of dialogue and relies the most on stage directions, which were written by a fan and are not from the original script, I’m guessing this word cloud says more about the word preferences of the specific person who wrote this transcript more so than any of the other word clouds (though likely the original script would do the same of those writers, and now I’m wondering whether my assumption of a single author is even remotely accurate as each episode in the show and across the series would have had multiple and at times different authors). Gentlemen/Gentleman are both relatively dominant in this word cloud, as well as lackey/lackeys (who assist the Gentlemen). Looks is also high up there again, and picture and talk seem to be about the same size, though I think talk is present here because of its absence, especially as can’t is also very prominent (“Can’t even talk/can’t even cry” is part of the Gentlemen’s grim fairy tale rhyme).

Voyant Cirrus word cloud for BTVS season 4, episode 10.

Once More, with Feeling

The last episode I chose is “Once More, with Feeling” (season 6, episode 7). The musical episode! This time all of Sunnydale is under the spell of a demon who forces everyone to communicate through song and dance. It’s a feast for the eyes and the ears. Someone in class had discussed sheet music as text, and certainly song lyrics are text. (Would choreography also count as text?) Going off of that, then, this episode would contain two texts (three?) actually, the music as well as the lyrics. How do these texts “speak” with and inform one another? How is the message conveyed differently? I’m not sure this word cloud can really measure that as it only represents the lyrical text and not the musical text. Sweet is by far the most used word (65); looks and looking feature prominently again (are these stage directions or spoken?). Song, singing, and dancing are also high up there, and I can see other music-related terms like musical, refrain, sing, sings, dances, music, rock. I’m happy to see that bunnies also makes the cut for this cloud (because “It must be BUNNIES!”).

Voyant Cirrus word cloud for BTVS season 6, episode 7.

Future Text Mining

What did I learn here? I’m not entirely sure. Also, I had all of my Voyant tabs open for a while, and I noticed that the layout of the word clouds kept shifting (like the stairs at Hogwarts). In checking my word cloud links, the displays have even changed from the versions I took sceenshots of (WHY?). This makes me even more confused about how Voyant determines the layout of this particular visualization tool and how useful it is to compare different words clouds against each other. Going forward I’d be curious to know how these clouds might look differently if I were working from the finalized scripts used for producing each episode. I’m also curious to dig a little deeper into the difference between stage directions and spoken dialogue. Do we need to distinguish them? It seems like they both ultimately work together to tell the story of each episode. Are there visual cues in the final episodes that are not represented in the stage directions? What does that say about translating text into images and vice versa?

Little Women, Little Surprises

As I mentioned on Wednesday, I went into this project thinking I had a hypothesis in mind and was determined to make a discovery. But I ended up spending most of this project just exploring the functionality of both Voyant and Google Ngram, and wasn’t able to really make any monumental revelations. I was even wracking my brain to come up with sample text that would reveal something, but struggled to think up anything specific. I ended up browsing the public domain and pulled up Louisa May Alcott’s, Little Women, just to get started with something. But exploring is good – and I enjoyed getting familiar with these text analysis tools.

Voyant is easy to use and gives lots of tools to click through and try out. It’s also very visually appealing. I pasted in the text of the full novel and first searched within the Trends and Bubblelines windows to see how often each of the sisters are mentioned throughout the course of the novel. The results are clear to follow and not too surprising (see the number of mentions of Beth and Meg decline after their death and marriage respectively – sorry, spoilers). I did find that 2 of the colors were a little too close in shade, and I didn’t figure out how to change them to see a little more color contrast. Next I wanted to see the equivalent searches for the major male characters of the novel. Laurie was already one of the most common terms, so he popped up in the top 10 drop down to select. But I had to stop and think about how to search for Friedrich Bhaer and John Brooke. I looked up both first and last names to find the most frequent name for each character, and ended up going with “Bhaer” and “John”. Again, clean narratives lines are the end result. I did like being able to reference the content in the Reader window by clicking on one of the points in the Trends chart and seeing it take you right to the start of that content segment. I also found it helpful to hover over an individual word in Reader to see its frequency, and then click to have that term override the Terms window to see that same frequency trend line over the duration of the novel.

Finally I explored the Links feature to see common relationships between words in the text. For obvious reasons I chose to look at the link between Jo and Laurie. It’s really entertaining to watch the word bubbles hover around between the connecting lines. Trends seems to be the default reactor for most of the clicks as immediately clicking on the links line creates a new Trends line there. I accidentally discovered this and had to re-do the previous search to go back.

Voyant really does all the heavy lifting for you, and there’s zero insight into how it operates behind the scenes. For quick, easy to visualize results, Voyant does a great job. While looking specifically at a novel, Voyant was useful for tracing narrative connections. I could see it being some kind of add on to Sparknotes for readers looking to dig deeper into content. But overall I think I was a little disappointed with the tool’s limitations.

Next I plugged “Little Women” into Google’s Ngram to see frequency trends of the novel title over time. Similar to my work in Voyant, I wasn’t too surprised with the results but had fun using the tool.

The frequency count begins to increase after 1864, continuing up steadily through the novel’s publication in 1868 and peaking at 1872. Then it plateaus and fluctuates through 1920 before dramatically increasing again with the highest peak at 1935. A quick search told me that the story’s first sound film adaptation starring Katharine Hepburn premiered in 1933. For me the best part of using Ngram was playing detective and digging up the reasons behind the frequency increases. A couple other highlights I clued into: the 1994 Academy Award-nominated film adaptation and a major mention in a popular episode of ‘Friends’ in 1997.

Overall I did find the text analysis praxis valuable because I was able to experiment and explore what the tools are capable of. Probably the most important lesson I learned is that projects don’t always turn out the way you expect them to. In a way this is similar to the mapping praxis but instead of the scope limiting me, it was the tools here that put up those constraints. I also think I got in my own way by having really high expectations going into things, thinking I would have a strong hypothesis up front and the tools would help me prove it.  We discussed that this area of DH can be challenging despite most initially assuming text mining to be immediately beneficial for projects in the field. And after this praxis, my assumptions have definitely changed.

Analyzing Hoxha’s Text

A few years ago I worked on an independent project that included some analysis of Enver Hoxha’s rhetoric during his rule, from WWII until the end of his life in 1985. More specifically, I looked at how Hoxha redefined terms central to his rhetoric over and over again to serve his political needs, especially during moments of political upheavals. I read through his body of work looking for the words ‘liberal’ and ‘conservative’ in over a half century old collection of his writings.

For this textual analysis assignment I wanted to see how these tools could help with my Hoxha research, as it is also central to what I am hoping will become my capstone project. Luckily there are a few online archives that contain Hoxha’s work. I collected the text from Marxists.org, a volunteer supported project that contains his complete works and a few selected English translations. The translated work does not include much of the time period or topics that would be useful to my project.

I began the analysis process by attempting to analyze the Albanian text. The results were what I expected: Voyant Tools produced only an error message, and Mimno could not read Albanian. I also used Google Ngram to see if there was a trend that pointed out a crucial date in my research (1972) and the results were really surprising, as Albania was completely closed off at the time and I did not realize the events of 1972 were felt outside the country.

I decided to focus on Voyant Tools because it was much more accessible, and continue to work with Hoxha’s English text. I played around with tools available on Voyant, looking for any relationship or word that could indicate a pattern I had not noticed by painstakingly reading through the text. Ultimately I found nothing useful for my research and I do not anticipate that to change with the rest of the English text. Although my project ideally lends itself to textual analysis, as it is looking at terms over a period of time, I don’t expect the relationships or patterns that could emerge from a textual analysis to dramatically affect my research. But I remain open to the idea that I might find more supporting evidence this way. I would love to have the opportunity to use the original Albanian and find out, because the visualizations on Voyant are great and this was fun.

Text Mining Praxis Assignment: Fanon’s ‘Black Skins, White Masks” in French vs. English

For my text mining assignment, I wanted to see what would happen if I tried separately inputting a book in two different languages. In particular, I wanted to see if Voyant could capture/visualize any translation decisions or “glitches in translation” that may come up when a language is translated from one language to another: Would some words appear more often in one language than another? Would some words not translate clearly? Can translation decisions be captured and understood clearly through a tool like Voyant?

I chose to input PDFs of Frantz Fanon’s 1952 book Peau Noire, Masques Blancs (French translated to English as Black Skins, White Masks), and its 1986 English translation by Charles Lam Markmann. Both versions were downloaded from Monoskop. In both texts, I had the word clouds show just the top 125 words of each text.

From inputting the two translations into Voyant, the most noticeable was a metadata issue, in that Voyant indexed “_black_skins_white_masks” from the English translated version, causing the “word” to take up a lot of space in the word cloud. Underlines were an issue for a few terms in the English version’s word cloud, as well as inclusions of Fanon’s name as a term. These, I presume, are pieces of metadata hidden throughout the PDF in ways that I cannot trace easily through a simply “ctrl+find” on my Preview application.

In regards to the actual terms, there were clear discrepancies in the frequency of term usage between the English and French versions of the text (at least from the eye of someone who doesn’t know French). To name a few examples: “Noir” was used 339 times, while “black” was used 357 times; “negrè” was used 399 times while “Negro” was used 436 times; “blanc” was used 289 times while “white” was used 504 times; and “l’homme” was used 94 times while “man” was used 423 times. On the one hand, as someone who does not know French, there may be other words in French taht were used in place of, say, “man”, that just “l’homme” and may diagnose the large difference in usage between the two terms.

On the other hand, what is made clear through the word clouds is that specific decisions are made in the act of translation that shows the non-linearity and non-neutrality of the very act. While this is a relatively obvious and drawn-out claim made time and again, it was interesting to see it happen in front of my own eyes. Additionally, its interesting to think about how my understanding of the text may change if/when I learn French and read the text in its original language. This made clear to me why one may prefer different translations to others, and how specific terms may not only better depict a certain claim, but also perhaps historicize and contextualize these claims in ways that particular translations may not be able to communicate.

Overall I found this assignment to be interesting and makes me think about language as a technology/technique in and of itself: The non-neutrality of language, perhaps, as a way in which specific ways of knowing and understanding are brought to the forefront in ways inextricable from power—which is something that Sylvia Wynter has talked about before.

Word Soup! – Voyant’s Text Analysis Tool

I wanted to test out Voyant’s proficiency when it comes to using a text with multiple languages. To do this, I inserted various texts into the software: English, Spanish, and two texts with a mixture of both. Was Voyant able to 1. distinguish between the two languages and 2. make connections between words and phrases in both English and Spanish?

I first used Red Hot Salsa, a bilingual poem edited by Lori Marie Carlson. The text is composed of English and Spanish words adding authenticity to the United States’ Latin American experience. Voyant could not recognize, distinguish, nor take note of the differences in word structure or phrases. The tool objectively calculated the amount words used, the frequency by which they were used, and wherein the text, these words appeared. Another test consisted of a popular bilingual reggaeton song entitled Taki Taki performed by DJ Snake, Ozuna, Cardi B, and Selena Gomez. The system was able to again capture the amount of words and their frequent appearance. Yet, the way it measured the connection was through word proximity and in a song which repeats the same words and phrases, this measurement is not clear.

Finally, I decided on an old English text, one of my favorite poems: Sweetest Love, I do not Go by John Donne. Here I looked at the links tool and noticed the connection between the words die, sun, alive, and parted. The tool gave me a visual representation of metaphors inside the poem ( just because we are apart, we won’t die, like the sun, I will come again, alive ). I found the links section the most useful part of Voyant.

While exploring this tool, I recalled Cameron Blevin’s experience with text mining and topic modeling (Digital History’s Perpetual Future Tense). Like most of these digital apparatuses, one must go in with a clear intention prior to the text’s analysis and background. Without this, the quantitative measures will be there, but they will not have much meaning. They will become just Word Soup!

“…using computation, and reframing the scale of literary inquiry, are two distinct things.”

Given all of the possibilities digital text creates for address (Witmore), I really appreciated the pieces from this week that trouble text analysis by questioning the over-reliance on computational methods to conduct it. 

By re-centering the physical book and its susceptibility, as it regards analysis, Witmore and the essence of the Underwood quotation for the title of this post inspired me to think about the conversation both readings provoke about the limitations of text analysis (as it stands) and why it matters to society at large.

When I think about the salience of distant reading and lemmatization (or categorizing text) as a means to digital text analysis, I am reminded not only of the analytical flaws that are probable in research reliant on it, but that the aggregated processing of information feels predetermined in our data-world. It is not enough to point out the cultural nuances lost in analysis, but that the surveillance and capitalist-driven attitudes toward a rapid, digitized world was widely accepted as an objective route to information and readership alike. 

I can recall very few public discourses that speak to the value of the manuscript such as the one that emerged when Amazon’s Kindle became popular in the late 2000s. While much of it spoke profit margins, modernity, and access (versus institutional research and DH), a handful of arguments advocated for the physicality of books and the analytical processes unique to it that computational analysis is missing.

In advocating for a better addressability in text analysis to yield the qualities associated with analyzing the manuscript, Witmore insists that a phenomenological shift would have to occur: 

“We need a phenomenology of these acts, one that would allow us to link quantitative work on a culture’s “built environment” of words to the kinesthetic and imaginative dimensions of life at a given moment.”

Much like Drucker (2014), there is a cultural address that should precede and become valued by non/DH-ers that concerns the way we analyze textual information as a society — past and present— and that begins with our foundational relationship to text.