top of page

How to Use This Website

This site was designed with two often overlapping audiences in mind: 1) scholars of digital humanities and/or fan studies; 2) anyone and everyone who may want to play with this data, whether they be scholars wanting to get experience in analyzing data or Harry Potter fans wanting to test their own private theories about characters. This tutorial was designed with both audiences in mind, so that everyone can understand the data and feel comfortable exploring the site and creating/analyzing visualizations. Whether or not you use text mining or study fan fiction, this site offers a safe space to play with a specific type of analysis with literally no consequence. So, have fun!

Before we get started with the tutorial itself (which will walk you through each page of the site), let's cover the basics:

What can I do on this site?

The main thing you can do on this site is walk through the steps of focusing and then interacting with and analyzing a visualization from data that we've already text mined. You very well may end up working with data that we have not analyzed ourselves, and we offer anything on this site for your own use.

So what data would I be using?

We ran an algorithm called Word2Vec on the 7 Harry Potter novels and 450 pieces of fan fiction from Archive of Our Own (for information on how we selected these 450 pieces as well as the information for each, see the Sources page). Specifically, we told the algorithm to look for 7 character names, which are the names that you can work with on this site: Harry, Ron, Hermione, Draco, Sirius, Voldemort, and Dumbledore.

What the hell is a Word2Vec?

Word2Vec is an algorithm that finds similarity between words by looking at both the context around words and how those words are used. Word2Vec looks at the three words before and after each word (minus words like "a" or "the") in a chosen corpus of words, such as a book series or Wikipedia page. It then creates vectors of this information that determines effectively what a word means for that specific corpus. When we tell the algorithm to look at a target word (say, Harry), it tells us the words that are most similar to that word's context and use within the corpus, rating every word on a scale of 0-1, with 1 being most similar and 0 being least. The idea behind Word2Vec is that it will allow you to pull words together that are used similarly. For example, Word2Vec might show that "man" is used similarly to both "woman" and "boy" but that "woman" and "boy" are less similar. Word2Vec does not find exact synonyms (after all, man and woman mean different things), but it finds the most similar words within the corpus, often finding connections that we might not otherwise see. Word2Vec does not tell us exactly what these words mean but rather highlights that these words have a similar meaning for this context. For more information on the nuts and bolts of Word2Vec, you can check out this site or this site.

What kind of analysis does Word2Vec allow? In other words, what can I possibly find?

Words mean different things to different people at different times in different places, and Word2Vec allows us to measure which words have similar meanings in different corpora. For example, the word "gay" has had drastically different meanings over time, and Word2Vec can help us see that by showing us what words are most similar to it at different points in time. Our original project was to see if a name, such as Harry, has the same meaning in Rowling's books that it has in fan fiction. After all, if we can see what words are most similar to "Harry", we can start to understand how that word is used and understood in a wider context. In other words, we hope that Word2Vec can show us what a name means in the context of a corpus and thus who Rowling or fans think that character is.

How is this different from just reading the fan fiction?

Most literary analysis today engages with what's called close reading, where you carefully read a piece and try to determine what the text (or author) was trying to do. This strategy is really useful, but, because there's so much fan fiction in the world, it's impossible to read all of it. Just on Archive Of Our Own alone, there's nearly six million pieces of fan fiction! What this site uses is called distant reading, where we try to quantify elements of fiction in order to look at trends across lots of fiction rather than just one or two. Both kinds of reading have their uses, but what we can see here are the underlying patterns in how words are used that might not be otherwise visible to the human eye while reading.

Why am I creating a visual? Why not just give me the numbers?

Actually, if you just want the raw numbers, we have that, too! Go check out the Sources page, and you can find all of the data we've collected (even the data we didn't end up including on the website) under the Data section. However, raw data can be overwhelming and/or unclear. Instead, this site allows you to choose and interact with several different kinds of visualizations (which, here, is just a fancy word for graphs). Digital humanities scholars often use visualizations both to help us understand our own data and to make sense of it for others. Different kinds of graphs will show you different kinds of information. We hope to make that clear by helping you work through the different visualizations we've provided on this website. If you ever feel like there's another kind of visualization we should use that we currently do not have on the website, let us know by emailing us here!

With the basics of what this site holds in mind, let's get started with that tutorial! The next four pages contain exact replicas of the pages you can find on this site with notes explaining what your different choices mean and why you may want to choose one over another, depending on what you're trying to find. As you go through, you will see an example of us building one of our own visualizations, followed by a quick analysis of it to show you what is possible on this site. The rest is up to you!

bottom of page