word2vecresults with PCA
April 3, 2016
The code for this project can be found on my Github, although it's still a work-in-progress.
Please note that this writeup is still a draft. I'll continue to update it in the coming week with more details and exposition.
In 2013, Google published the
word2vec algorithm, which analyzes a large body of text and embeds each word into a vector space such that linguistic/semantic relationships between words are generally preserved.
The above image is an example of such semantic preservation. On the left panel, we see that the vector from "man" to "woman" is approximately the same as the vector from "king" to "queen". In vector arithmetic, we can express this as:
king - man + woman ~= queen. Similarly, on the right panel, we can see that taking a singular noun to a plural noun is equivalent to adding the same vector to both "king" and "queen", i.e.,
kings = (queens - queen) + king.
However, prior to March 2016, there existed no easy way for people to use
word2vec easily to explore the structure of large datasets of text. Although
word2vec has implementations in C and Python, there was no layman-accessible interface for running the algorithm and visualizing the results.
To remedy this, I created a web service which takes the results of
word2vec and allows the end user to visualize its results with regard to the principal components of the word embeddings. Since the words are embedded into a high-dimensional vector space, we must perform dimensionality reduction before we can generate human-accessible graphics, and the technique of principal component analysis is one such way of doing so.
Principal component analysis extracts the principal component of a dataset by calculating the dimensions which successively explain the most variance as possible in a dataset under the constraint that the dimensions remain orthogonal to each other. That is to say, it is a rotation and labeling of the coordinate axes so that the first principal component accounts for as much variation in the data as possible, the second principal component accounts for as much of the remaining variation as possible, and so on and so forth.
Combining these two powerful techniques—
word2vec and PCA—we are able to reach a number of compelling and stimulating results in a manner easily reproducible (via my web interface) by someone with no knowledge of programming whatsoever.
We can create word cloud graphics by calculating a couple of the principal components of the
word2vec results and then arranging together the words at both extreme ends of a principal component into an image.
We can generate word clouds for the comments on Hacker News, a Reddit-style webforum for software engineers and technology enthusiasts where users can submit links and comment on other users' submissions.
We can understand the first principal component, depicted above, as having concrete technologies on one end of the component and social issues on the other end. The interpretation of this is that the main variation in the semantic content of the discussion on Hacker News is basically technology vs. oppression.
This isn't surprising: on a technology forum, one certainly expects people to be discussing technologies relevant to software engineering ("lua", "backends", "webserver"), and since software engineers are predominantly upper-middle-class and the central hub of software engineering in the United States in Silicon Valley (located in a state noted for its political liberalism), it's also unsurprising that Hacker News users have an ardent interest in remedying social ills ("hardship", "suppressed", "protesting", "marginalized"). And, of course, these two topics are disparate and unlikely to occur together, which is precisely why they form opposite ends of this principal component.
The second principal component of the word vector space is illustrated above and is a little more difficult to interpret. At the top, we clearly have a lot of locations ("victoria", "sierra", "vancouver", "portland"); at the bottom, we have a variety of abstract and vaguely mathematical words ("orthogonal", "illogical", "inherently").
My speculation is that this second principal component reflects an abstract vs. concrete divide in the content of Hacker News comments. Why do we not see words about, say, computers and hardware on the concrete end of the component instead of just locations? Well, recall that the second principal component explains the variation that isn't already explained by the first, so since we've already accounted for specific technologies and social issues, the concrete aspects of Hacker News comments are dominated by locations.
We can also generate word clouds for the comments on LessWrong, a forum for people interested in rationality as well as a number of other tangentially related topics like effective altruism and AI risk.
The words at the positive end of the first principal component are the sort of words that you'd use to criticize someone's reasoning, like "uncharitably", "unjustified", and "misrepresenting". (Interestingly, note the presence of "elizier".) On the other end of the dimension, we see concrete, scientific, almost industrial words—"trucks", "panels", "gas", and so on and so forth.
I suspect that the main divide here is abstract vs. concrete (like the second principal component of the Hacker News comments) and the specific topics reflected here just indicate the topical choices of LessWrong commenters. It's interesting that, unlike with PC1 of the Hacker News comments, there isn't a topical divide so stark that it becomes more important than the fundamental divide between abstraction and specifics.
The most obvious interpretation of the second principal component, upon seeing this graphic, is "evil vs. math". It's not immediately clear to me what to make of this, but it is interesting. Other ways to interpret this include verbs vs. nouns (the evil words mostly describe evil actions like "raping", whereas the mathematical terms describe specific notions or techniques like "decomposition" or "manifolds") and emotional vs. unemotional.
We have been able to produce a number of interesting visualizations of the results of
word2vec on datasets of text.
There is still ample space for extra work in this direction, e.g. with paragraph vectors or
document2vec, which are extensions of
word2vec. Also, there are many other different ways to visualize the results of these computations, and a more interactive, versatile user interface for producing a larger variety of graphics would expand our ability to understand the semantic variation within text corpora.