In our last post, we started exploring the challenge of extracting topics from tweets, and one key takeaway from this is that there are no easy way of doing it. But they say that learning how not to do something is stil learning something. As Edison said: “I have not failed. I’ve just found 10,000 ways that won’t work.”
As you might remember from that post, our goal was (and still is) to find the main topics from a month’s worth of tweets that contains the hashtag #protest. We want to keep track of protest movements and topic extraction would allow us to assign tweets to their appropriate movement.
Our initial idea was to use the most frequent hashtags, as we thought that movements would be branding themselves with a unique hashtag that would just pop out. But it didn’t. The only hashtag that pops out is #trump and I’m still trying to figure to whom or what that refers to 😉
So alright, we need to get a bit more clever. Time to dust off that clever hat and try something different. How about network analysis, you ask? Yeah, how about network analysis.
Each hashtag is some sort of entity and we have relationships between them when they are found within the same tweet. And after doing some very thorough research on network analysis (aka reading that wikipedia page on social network analysis), well it just seems that having nodes (entities) and edges (relationships) is all that’s really needed.
On we go with network analysis then!
It should be mentioned that I’m using R for most of my analysis and would be happy to share my code if someone’s interested. In fact, maybe I’ll just do a big dump of all the code that relates to topic extraction from Tweets in a public github repository at the end of that series.
After magical incantations and a little bit of twisting of the arm, I get the following from our network analysis of tweet’s hashtags. Not perfect, but certainly more useful than the glut of info we had previously.
Notice that association at the bottom left between “engagelk” and “math”? Well, that one makes me laugh. No clue how they got associated to #protest tweets 🙂
The previous graphic is based on an object that can be transformed to a table of all associations that forms each cluster.
That information can then be used to associate tweets to topics (groups or clusters) which contain the hashtag associations found in that tweet.
Ahhh, I think we’re getting there. But I still think we can do better. In fact, I’m really positive about it since I’m close to having the results in front of me of our next attempt at solving the topic extraction challenge (I know I know, kinda cheating there).
Next post will be about text mining. Excited yet?