Introducing the discursus Project
Society is a continuously morphing entity. The output of countless dynamics between us, the social animals.
Some dynamics have little effect on the social tissue. Others give way to deep, far reaching changes that can trigger revolutions.
Protest movements are such powerful dynamics. Behind the physical manifestation of protests are discourses that morph. They are incubators of ideas. Like viruses that mutate and that can spread and change our ethos.
The problem is that it’s hard to observe, study and understand those events. To extract clear demands. Protests are a chaotic mess of citizen requests, an effervescent debate on how we want to all live together.
But to hear those active citizens, make sense of their visions — that confronts us as a society and leads to healthier and more vibrant democracies.
I believe a healthy democracy needs to be confronted by social-driven protests. They force institutions to constantly rethink themselves. To not stay still, rigid and eventually corrupt.
I’ve been following protest events for a while now. Trying to get a sense of the discourses that shape them. Back when I was studying political science, I thought philosophers were the ones shaping society. But their ideas fuel the social magma, becoming manifest in the streets, through the exchange and confrontation of ideas.
The discursus project is an open source platform that mines, shapes and exposes the digital artifacts of protests, their discourses and the actors that influence social reforms.
The Evolution of discursus
I’ve been exploring the intersection of protest events and analytics for quite a while now.
Back in 2013, prior to the football’s World Cup, Brazil erupted in protests due to what was perceived as misallocation of resources. As the state was spending lavish sums of money in building stadiums to impress the world, a sizeable portion of the population lived in poverty and couldn’t afford the price raise of bus fares.
The 2014’s Scotland Referendum was less about a specific protest movement then about tracking the conflicts between actors during a time where society was confronted by a divisive question.
Or here is a map of daily world protests in 2017, which I used to track protest events through multiple sources.
The initial intent was to get data-driven insights on political events. And I was able to hack things together, but never get a stable stack that served my needs. Which is part of what drove the current discursus project.
But I’m getting ahead of myself. We need to talk about what has fed those protest analytics since the beginning, the GDELT data project.
The GDELT data source is a fascinating project founded and driven by Kalev H. Leetaru, who is a “Media Fellow at the RealClearFoundation and a Senior Fellow at the George Washington University Center for Cyber & Homeland Security and a member of its Counterterrorism and Intelligence Task Force”.
Here is a collection of the most interesting (and stunning) facts about this project:
- Monitoring of the world’s news media in over 100 languages
- An archive of events dating all the way back to January 1st, 1979
- Data is updated every 15 minutes
- Probably the largest realtime streaming news machine translation deployment in the world
- Extracts more than 300 categories of events, millions of themes and thousands of emotions from source articles
- Nearly 60 attributes are captured for each event, including the approximate location of the action and those involved.
There are other similar projects out there (most common alternative is the DARPA affiliated ICEWS dataset), but none that has the scope of GDELT.
I don’t remember exactly how and when I stumbled upon GDELT, but that was the early days of my interest in analytics. I came from a background in political science and computer science and was slowly transitioning to a career in data analytics. GDELT kinda sealed the deal as it was all that excited me about data.
When I decided to leave my steady job back in around 2015 and try my hand at freelancing, one of the most active stream of work I got was helping extract and make sense of the GDELT dataset. But as I said above, the process was laborious each time, as I lacked the skills and experience in structuring data stacks.
That’s a problem I also felt as I became more and more involved in product analytics projects. I discovered dbt around that time and that led me down a path of discovery of what became known as analytics engineering.
A few years later, I still have quite a lot to learn in regards to building high quality, stable, agile data stacks, but I now have a solid foundation which drove me back at this problem that I wanted to solve with GDELT: how to easily mine, enhance and consume protest data.
Data Journalism and OSINT
In parallel to becoming familiar with GDELT, I also remember how excited I was of data journalism back when Jer Thorp was the data artist for New York Times. I had attended a keynote of his and I believed at that time that data journalism was going to free us from “subjective” coverage that was led by editorial stances.
Data journalism went on to produce a lot of innovations in how we build datasets around diverse domains, but most of all helped tell compelling stories of that data. I remember the insightful interactive visualizations made with Processing and d3.js.
The domain of data analysis had broaden to include journalists and artists that really expanded the scope of what analytics could cover and which audience was exposed to it.
Another fascinating expansion lately has been around the OSINT (Open Source Intelligence) methodology and especially the Bellingcat group. We are now dealing with collectives of citizens that are driven by uncovering facts about world events and get the stories straight. In the case of Bellingcat, they’ve managed to expose quite a few cover ups by state powers that thought they could commit criminal acts in all impunity.
It’s those tracks of innovation that motivates me to open source this project to data journalists and citizen groups. This project’s mission is to simplify access to rich datasets such as GDELT and provide insights into world protests, their actors and discourses.
Building on Modern Data Stacks
This project is the result of me working on the development of modern data stacks for the past 5 years or so.
I won’t go into the details of what modern data stacks are as this is out of the scope of this article, but there are a few key attributes that the reader should be aware of:
- Modern data stacks are fully cloud-based
- They are modular stacks, in the sense that each components of the pipelines (ingestion, warehousing, transformation, BI) should be interchangeable with other providers.
- They adopt the best DevOps principles and practices to create an ecosystem of DataOps tools that make our stacks agile, version controlled, continuously integrated and deployed, and of high quality.
Obviously, there is quite more to say about modern data stacks, but the point is that the discursus project has been built on top of those principles, practices and tools.
At of version 0.0.1, our stack looks like the following:
So what’s the status of this project at the time of this writing?
We’ve recently released a 0.0.1 version of the discursus_core project. This is the first stable release of the project and its main features are:
- A miner that sources events from the GDELT project and saves it to AWS S3.
- A dbt project that creates a data warehouse which exposes protest events.
- A Dagster orchestrator that schedules the mining and transformation pipelines.
Anyone can as of now clone this project and run an instance of it on their own server. Our current playground is a simple EC2 instance on AWS.
The Dagster orchestrator automatically launches the following 2 pipelines on a regular schedule:
- data_mining_pipeline - This is the miner that scrapes and store GDELT data in AWS S3 every 15 minutes.
- transform_data_pipeline - This pipeline warehouses the S3 data in a Snowflake instance, runs dbt transformations and test the output based on schema expectations.
The current and sole fact table that populates the core warehouse is a wh_events_fact table. This provides access to a cleaned, constantly updated table with all the attributes of protest events around the world.
Here’s what the current ERD (entity relationship diagram) looks like:
With that data, you can map protest events and drill down into the individual protests that are happening in specific regions of the world, at specific times.
Or we can map the main actors that have been involved in protests in the past 7 days in the US for example.
Point is that there is a lot of ways to aggregate and cut down that data. We want to give end users access to high quality data that can be consumed on its own or joined to the other sources you might be using.
What’s the Future
This is just the beginning and now that the foundations are in place, we want to expand the dataset.
There are immediate needs that we’ll expand on below, but we want to emphasize that after-the-fact descriptive analytics is clearly the early goals of this project, but we want to go further than that.
We’d like to provide a real-time window into protest events as they unfold, a tool that extracts in real-time how protest events relate to one another, what are their main discourses and the actors that are behind them.
This is a long-term goal for sure, but it’s technically achievable. We hope to drive higher quality coverage of those events and help citizens influence institutional reforms.
Short Term Roadmap
As mentioned above, for now we want to improve the quality and scope of the data we are outputting through the discursus project. With that in mind, here are a few improvements on the short-term roadmap:
- The ERD above is quite slim at the moment and as a priority we want to: extract dimensions from the fact table; expand the number of fact and xa (extended aggregates) tables.
- As a second level of expansion, we want to enrich GDELT data by extracting topics from article’s meta data.
- A third level of expansion is to start including new sources of data, such as Twitter, Wikipedia, etc. to provider richer contextual information to the protest events we mined from GDELT.
But the future is not only about expanding the dataset. I’d love for you to get involved as well. There are many ways to contribute:
- Visit the discursus_core project, star it, fork it, take ownership of some of the issues we already documented, and send over some PRs.
- Spin up your own clone of the project, build visualizations and analysis and spread the word about this project.
- Create new issues on how you’d like to see this project evolve.
- Just fork the project and do your own thing with it — credits are appreciated, but not required.
People take to the street to be heard. But their ideas and requests are lost in a maelstrom of institutional discourses. Or it gets manipulated by big interest groups.
We want to provide a window into the unfiltered, unbiased, real reforms claimed by citizens.
If that mission speaks to you, please have a look at the project’s Github repo and just reach out there or to me directly.