Extracting topics from tweets is turning out to be quite the adventure. And so it just feels justified to do a couple of posts on what has become an exploration of many techniques to do what seemed at first like an easy task.
The discursus.io Project
So let’s start by giving a little bit of context to the task at hand.
The goal of the discursus.io project is to monitor protest movements around the world in real-time. For now I have 2 main datasets to do this:
- GDELT – A massive database of all events scraped and updated every 15 minutes from newspapers around the world.
- Twitter – Real-time scraping of all tweets that contain the #protest hashtag
All data are currently being pushed to 2 databases: ElasticSearch for social media data + MySQL for GDELT events.
Our immediate goal is to join those 2 datasets. To do this, I would like to join them based on the protest movement associated to both databases: GDELT’s articles are covering specific protest movements; tweets are in reactions to specific movements.
So for example, this is a map of all protest events that happened on March 14, 2017. And that big blue dot in the middle of the map are protests in Turkey following Netherland’s denial of entry to Turkish ministers.
Our goal would be to associate each protest event to a distinct protest movement (for example, quite a few of the protest events that are mapped in the US are related to Trump’s Muslim ban). And then to superpose all tweets on top of related protest movements.
The Intuitive Approach
My initial assumption was that it would be easy to extract clear protest movements from tweet’s hashtags. My assumption was based on past observations that protest movements usually “branded” themselves with a unique hashtag to help amplify the movement. Let’s think of the Occupy protest, or the protest movement in Brazil in 2013.
Based on that assumption, I created a script to extract the most popular hashtags for a ~30 day period. So, for a month’s worth of tweets, I have the following graphs…
So this is not too bad when looking at it. But of course, you actually have to get a human involved to make sense of it, because the links are not made obvious when singling out the most popular hashtags. For example, you need to know about the Dhammakaya Movement to know that the keywords #dhammakaya and #thailand are most probably associated to the same protest movement.
That leads to combining hashtags and seeing if hashtag combos would give us a clearer delimitation of protest movements. When combining hashtags, we get those following most popular combos…
It does help to give a bit more context to each hashtag, but honestly it doesn’t provide any clear-cut topic extraction. Worst, there are a lot of overlaps and really generic hashtags, such as #humanrights, #resist, and such.
Gotta Keep Experimenting
Obviously, that intuitive approach (or I should probably call it the naive approach) is not worth much. I do not have some really clearly defined hashtag in the most popular ones that links to a specific movement, nor do I have combinations of hashtags that would give us an even clearer indication of those movements.
So we have to keep on trucking.
How about network analysis then? Each hashtag is some kind of entity that has relationships with other hashtags/entities that were on the same tweet. Maybe that will give us clearer demarcation in what the protest movements are?
Well, I’m almost done experimenting with that approach and I’ll share the results in the next post.