It’s been a while since I’ve shared the results of the Topic Extraction experiment, but you know: Summer.

Excuses aside, on our last set of posts (the intuitive approach and the network analysis approach), we explored 2 different ways of extracting topics from a set of tweets (#protest tweets from February 8th to March 9th 2017). Now, we finally get to what we should have started with all along: text mining! (dramatic effect here).

Text mining is a broad term and encompasses a lot of stuff (sentiment analysis anyone?). But what we’re interested in for this experiment, is to first group all tweets together and then use the LDA statistical model to extract topics and its most relevant terms.

We’re in the process of making an interactive app for viewers to actually see how this works. But for now this is a screenshot of the extracted topics on the left and how when we select a topic, it displays its associated relevant terms on the right. There is much to be said about that visualization, but we’ll get into those details once we get a live demo going.

Topic Extraction - LDA Visualization

Credit goes to Carson Sievert who made that interactive LDA visualization package available to the R community.

Our goal is not to get into the technical aspects of this solution, as there are a lot of very accessible and well-made resources on text mining available, but we wanted to introduce you to that solution as we’re moving closer to our final objective: providing a real-time classification and topic extraction solution for daily tweets from a cohort of users (either business customers, an HigherEd institution’s students, etc.).

In a next post on topic extraction, we’ll actually integrate a first layer of analysis before doing any topic extraction. By using a new dataset on student tweets, we’ll first classify them (is the tweet about academics, financial matters, community life, professional work, personal stuff, etc.) and then we’ll apply the topic extraction model on top of those categories. Curious to see if our topic extractions will improve with this pre-processing.

P.S. We plan on providing access to a GitHub repository that will host the dataset we’ve worked with here, but also the code used to generate our different topic extraction models. This is in the books for the near term, but if you’re interested in having access to this earlier, please drop me a note.