Blog

Subscribe to our Data Product newsletter.

 

Machine Learning Enrichment in your Data Asset Production Flow

Discursus 0.0.3 is out and we’re excited about the new machine learning (ML) layer we’ve added. It opens up a world of possibilities and we want to share how we tackled that challenge and what it means for the discursus project.

This release leverages Dagster, dbt and Novacene AI to enrich data in the production of data assets. This allows for a very versatile, stable and efficient process to automate the production of data assets.

Architecture

To set the context, here’s a diagram of the data asset production flow for discursus.

discursus core data platform

Couple of points to note:

This flow is supported by the following architecture.

discursus core architecture

Its components are:

Let’s explore further the Machine Learning as a Service (MLaaS) layer we’re introducing in this project.

MLaaS — Training, hosting and using ML models

Full disclosure, Novacene AI is a supporter and a friend of the discursus project, and since theirs is a high-quality service and the team is knowledgeable and always ready to help, it was an obvious choice for me to use this MLaaS platform in the discursus project. Of course, just like S3, dbt, Dagster and Snowflake could all be replaced with equivalent tools, other suitable MLaaS would likely work here too.

That said, we’ve seen how Novacene fits into our data asset production flow, as well as in our architecture. To frame the process of how we work with this service, here’s a diagram of how discursus and Novacene AI interact together.

Interaction between discursus and Novacene AI

Let’s briefly go over each of those steps:

Novacene is ideal for the discursus project as we can easily train and test models with existing / custom algorithms, host our ML models, send data for enrichment and then retrieve the results. We do all this through their very flexible API.

As a side note, if you’re considering working with Novacene and need a Dagster resource to interact with it, have a look at the one we’ve put together for discursus.

ML Training Engine

The ML Training Engine is our internal process to manually classify streaming events to continuously improve our ML models. As we use supervised algorithms, we needed a process to introduce a human in the middle.

discursus core data platform

Essentially, as we mine and enrich new data, we send a sample to a Airtable base which provides a very convenient interface to work with. That sample includes the event’s article metadata, as well as our Model’s classification prediction. We can then tag those classifications as accurate or inaccurate.

discursus ML training engine — Interface

With time, we can use a larger dataset with better manually classified examples to retrain our ML model.

What it means for discursus

As a reminder, the discursus project is an open source data platform that mines, shapes and exposes the digital artifacts of protests, their discourses and the actors that influence social reforms. As a data platform, that means we aim to expose the entities (events, actors and narratives) that are encapsulated in protest movements, as well as their relationships.

We want to create a data platform that will allow analysts to explore, visualise and convey the story arcs of protest movements. That means an analyst should be able to explore the dynamics between entities, the timeline of those dynamics, as well as the narratives that trigger the movement, sustain it and eventually leads to their end.

We’re still a long way to being able to tell data-rich stories about protest movements. But we hope machine learning will help us improve the quality of our data which defines those entities, their relationships and how they morph through time.