From There to Here
We didn’t set out to study fanfiction. Rebecca’s original idea was to investigate how Rowling is described on social media which prompted Tolonda to suggest that we also look at how Harry is described in the books. We wanted to see if the words that were used in tweets about Rowling commonly appeared near the names of characters, thinking first of Harry and then shifting to Dumbledore and Rita Skeeter. We went through several iterations of that project, even going so far as to subscribe to a Sentiment Analysis service, before we realized that we actually had two parallel projects that didn’t necessarily relate to each other. At this point, Rebecca decided to leave social media for another day and wondered if we could compare the way Rowling talks about a character to how fans talk about that character by analyzing fan fiction. This would allow us to use the same method of investigation – text mining – across Rowling-generated and fan-generated text. The question of how fans interact with Rowling remained consistent even as we shifted the specifics of our inquiry.
Methodology
So, how did we go about finding out what words appeared near other words? This was tricky! The first thought we had was to generate a word cloud but soon realized that the word clouds you can create on the web are based on word frequency, which isn’t what we wanted to know. When we googled “proximity based word cloud,” we learned about an algorithm called Word2Vec. Using the programming language Python, Word2Vec allows the researcher to get a sense of which words are close to a target word in a text. The first step is to establish the text or collection of texts, known as a corpus, that you want to study. In our case, we worked with sixteen corpora: each book in the Harry Potter series and then a collection of popular Harry Potter fan fiction for each year from 2009 to 2016 (see here for more information on how the fan fiction corpora were created).
Next, Tianyu eliminated the stop words from each corpus. Stop words are words like “a,” “of,” and “the” that appear frequently in written text in English that don’t really mean anything. Now she was ready to train the Word2Vec models – one for each corpus – from which we would extract the data that would inform our analysis. Training the models established vectors for each word remaining in the corpus and, taken together, these vectors exist within a three-hundred dimensional vector space. According to Jayesh Bapu Ahire, “a word vector is a row of real valued numbers...where each point captures a dimension of the word’s meaning and where semantically similar words have similar vectors. This means that words such as wheel and engine should have similar word vectors to the word car (because of the similarity of their meanings), whereas the word banana should be quite distant.” We don’t really understand the ins and outs of this, but if you are interested in the technical details, check out this website. Because a vector has a direction, the relationships between two word vectors can be calculated based on the angle the two word vectors make to each other.
Let me pause here and explain that two words being close to each other in the vector space is not the same as those two words being close to each other in the corpus itself. When examining the list of neighbor words for the target word “Ron” in Goblet of Fire, we were surprised to find the word “Frank.” After searching the electronic version of that book, we confirmed our suspicion that the only “Frank” was in fact the muggle Frank Bryce from the first chapter. The trouble was, Ron is not in that chapter. As Tianyu explained to us, the high number of dimensions in the vector space mean that words are considered neighbor words when one is close to a word that is close to a word that is close to a word (etc, up to 300 times). So, if the text tells us that “Frank climbed the stairs” and that “Frank unlocked the door” while also telling us that “Ron climbed the stairs” and “Ron unlocked the door,” Ron and Frank would be neighbor words in the vector space because they both appear close to the words “climbed,” “stairs,” “unlocked,” and “door.” The word “Frank” shows up on Ron’s list of neighbor words not because they are neighbors in the text but because they are neighbors in the vector space. The closeness refers to the meaning of the word, not to its location in the corpus.
How did we get from closeness to similarity? By way of the distributional hypothesis.
The distributional hypothesis is a set of assumptions at the core of how computational linguistics thinks about texts. These assumptions are:
​
a) a word’s meaning is tied to how often it occurs;
b) a word’s meaning is tied to how often it occurs with other words in a given context;
c) these relationships are entirely contingent upon the scale of analysis;
d) and these relationships can be rendered spatially to capture semantic associations between them (Piper 13)
​
So in this context, two similar words are not synonyms for each other, they are words that appear close to each other in the vector space. The most similar word to Harry in first book is Hagrid, but that does not mean that Harry is a half-giant with a penchant for monsters. Instead, it means that “Harry” and “Hagrid” appear around similar words and are used in similar ways.
When we queried each model based on the target word we wanted to study, the output was a spreadsheet with three columns: rank (between 0 and 199), neighbor word (almost any word in the corpus), and the cosine similarity (a number between 0 an 1 which indicates level of similarity). We decided to look at just the names that show up on this list of 200 neighbor words. So we can see that for Harry in the first book, the first name is Hagrid. But if the target word is Hermione in the first book, the closest name is Harry, and the same is true if the target word is Ron. So the data gives us a sense of the relationships between these characters, which is what we explored in our analysis.