Blog

Subscribe to our Data Product newsletter.

 

Introducing the discursus Project

Hong Kong protests (source: The New York Times)

Society is a continuously morphing entity. The output of countless dynamics between us, the social animals.

Some dynamics have little effect on the social tissue. Others give way to deep, far reaching changes that can trigger revolutions.

Protest movements are such powerful dynamics. Behind the physical manifestation of protests are discourses that morph. They are incubators of ideas. Like viruses that mutate and that can spread and change our ethos.

The problem is that it’s hard to observe, study and understand those events. To extract clear demands. Protests are a chaotic mess of citizen requests, an effervescent debate on how we want to all live together.

But to hear those active citizens, make sense of their visions — that confronts us as a society and leads to healthier and more vibrant democracies.

“We are the social network!” — Slogan from the 2013 Brazilian protests

I believe a healthy democracy needs to be confronted by social-driven protests. They force institutions to constantly rethink themselves. To not stay still, rigid and eventually corrupt.

I’ve been following protest events for a while now. Trying to get a sense of the discourses that shape them. Back when I was studying political science, I thought philosophers were the ones shaping society. But their ideas fuel the social magma, becoming manifest in the streets, through the exchange and confrontation of ideas.

The discursus project is an open source platform that mines, shapes and exposes the digital artifacts of protests, their discourses and the actors that influence social reforms.

Map of World Citizen Protests in Last 7 Days (on June 15, 2021)

The Evolution of discursus

I’ve been exploring the intersection of protest events and analytics for quite a while now.

Back in 2013, prior to the football’s World Cup, Brazil erupted in protests due to what was perceived as misallocation of resources. As the state was spending lavish sums of money in building stadiums to impress the world, a sizeable portion of the population lived in poverty and couldn’t afford the price raise of bus fares.

Brazil protests in 2013

The 2014’s Scotland Referendum was less about a specific protest movement then about tracking the conflicts between actors during a time where society was confronted by a divisive question.

Scotland referendum of 2014

Or here is a map of daily world protests in 2017, which I used to track protest events through multiple sources.

Daily dashboard of protests in 2017

The initial intent was to get data-driven insights on political events. And I was able to hack things together, but never get a stable stack that served my needs. Which is part of what drove the current discursus project.

But I’m getting ahead of myself. We need to talk about what has fed those protest analytics since the beginning, the GDELT data project.

GDELT Inside

The GDELT data source is a fascinating project founded and driven by Kalev H. Leetaru, who is a “Media Fellow at the RealClearFoundation and a Senior Fellow at the George Washington University Center for Cyber & Homeland Security and a member of its Counterterrorism and Intelligence Task Force”.

Here is a collection of the most interesting (and stunning) facts about this project:

There are other similar projects out there (most common alternative is the DARPA affiliated ICEWS dataset), but none that has the scope of GDELT.

I don’t remember exactly how and when I stumbled upon GDELT, but that was the early days of my interest in analytics. I came from a background in political science and computer science and was slowly transitioning to a career in data analytics. GDELT kinda sealed the deal as it was all that excited me about data.

When I decided to leave my steady job back in around 2015 and try my hand at freelancing, one of the most active stream of work I got was helping extract and make sense of the GDELT dataset. But as I said above, the process was laborious each time, as I lacked the skills and experience in structuring data stacks.

That’s a problem I also felt as I became more and more involved in product analytics projects. I discovered dbt around that time and that led me down a path of discovery of what became known as analytics engineering.

A few years later, I still have quite a lot to learn in regards to building high quality, stable, agile data stacks, but I now have a solid foundation which drove me back at this problem that I wanted to solve with GDELT: how to easily mine, enhance and consume protest data.

Data Journalism and OSINT

In parallel to becoming familiar with GDELT, I also remember how excited I was of data journalism back when Jer Thorp was the data artist for New York Times. I had attended a keynote of his and I believed at that time that data journalism was going to free us from “subjective” coverage that was led by editorial stances.

Data journalism went on to produce a lot of innovations in how we build datasets around diverse domains, but most of all helped tell compelling stories of that data. I remember the insightful interactive visualizations made with Processing and d3.js.

The domain of data analysis had broaden to include journalists and artists that really expanded the scope of what analytics could cover and which audience was exposed to it.

Another fascinating expansion lately has been around the OSINT (Open Source Intelligence) methodology and especially the Bellingcat group. We are now dealing with collectives of citizens that are driven by uncovering facts about world events and get the stories straight. In the case of Bellingcat, they’ve managed to expose quite a few cover ups by state powers that thought they could commit criminal acts in all impunity.

It’s those tracks of innovation that motivates me to open source this project to data journalists and citizen groups. This project’s mission is to simplify access to rich datasets such as GDELT and provide insights into world protests, their actors and discourses.

Building on Modern Data Stacks

This project is the result of me working on the development of modern data stacks for the past 5 years or so.

Modern data stacks (credit: dbt)

I won’t go into the details of what modern data stacks are as this is out of the scope of this article, but there are a few key attributes that the reader should be aware of:

Obviously, there is quite more to say about modern data stacks, but the point is that the discursus project has been built on top of those principles, practices and tools.

At of version 0.0.1, our stack looks like the following:

discursus 0.0.1 architecture

Current Status

So what’s the status of this project at the time of this writing?

We’ve recently released a 0.0.1 version of the discursus_core project. This is the first stable release of the project and its main features are:

Anyone can as of now clone this project and run an instance of it on their own server. Our current playground is a simple EC2 instance on AWS.

The Dagster orchestrator automatically launches the following 2 pipelines on a regular schedule:

  1. data_mining_pipeline - This is the miner that scrapes and store GDELT data in AWS S3 every 15 minutes.
  2. transform_data_pipeline - This pipeline warehouses the S3 data in a Snowflake instance, runs dbt transformations and test the output based on schema expectations.

Dagster orchestrating the stack’s pipelines

The current and sole fact table that populates the core warehouse is a wh_events_fact table. This provides access to a cleaned, constantly updated table with all the attributes of protest events around the world.

Here’s what the current ERD (entity relationship diagram) looks like:

discursus_core ERD

With that data, you can map protest events and drill down into the individual protests that are happening in specific regions of the world, at specific times.

Map of US Citizen Protests in Last 7 Days (on June 15, 2021)

Or we can map the main actors that have been involved in protests in the past 7 days in the US for example.

US Citizen Protest’s Main Actors in Last 7 Days (on June 15, 2021)

Point is that there is a lot of ways to aggregate and cut down that data. We want to give end users access to high quality data that can be consumed on its own or joined to the other sources you might be using.

What’s the Future

This is just the beginning and now that the foundations are in place, we want to expand the dataset.

There are immediate needs that we’ll expand on below, but we want to emphasize that after-the-fact descriptive analytics is clearly the early goals of this project, but we want to go further than that.

We’d like to provide a real-time window into protest events as they unfold, a tool that extracts in real-time how protest events relate to one another, what are their main discourses and the actors that are behind them.

This is a long-term goal for sure, but it’s technically achievable. We hope to drive higher quality coverage of those events and help citizens influence institutional reforms.

Short Term Roadmap

As mentioned above, for now we want to improve the quality and scope of the data we are outputting through the discursus project. With that in mind, here are a few improvements on the short-term roadmap:

But the future is not only about expanding the dataset. I’d love for you to get involved as well. There are many ways to contribute:

People take to the street to be heard. But their ideas and requests are lost in a maelstrom of institutional discourses. Or it gets manipulated by big interest groups.

We want to provide a window into the unfiltered, unbiased, real reforms claimed by citizens.

If that mission speaks to you, please have a look at the project’s Github repo and just reach out there or to me directly.