Blog

Subscribe to our Data Product newsletter.

 

discursus Core - The Final Semantic Layer

Version 0.1 of the open source discursus Core project

 

Version 0.1 refines and implements the discursus semantic layer. This is to represent the skeleton of the protest movements we want to provide an abstraction for. This completes the foundational implementation of our core entities.

Implementing a comprehensive semantic layer means that we:

Here’s an updated design of our data platform.

discursus data assets

Monitoring Protest Events with Data

First off, before we get into all the details, why are we building this? Well, the mission of discursus is to provide data-driven insights into protest movements, giving us a more comprehensive and objective view of which events are associated to a movement, their actors and their narratives.

I’m really excited to be introducing a new protest monitoring dashboard that provides a high-level view of all protest movements happening around the world, in almost real-time.

discursus protest monitoring dashbaord

I often say that I’m the #1 user of discursus and this new interactive environment is making exploring the platform’s data even way more fun. Hope you’ll enjoy it as well.

About the semantic layer

The semantic layer is the skeleton of the phenomenon / domain we want to provide an abstraction for. It has entities and relationships, and those have attributes that change in time. What’s the full abstraction we’re trying to build for discursus and what’s to be the ERD after version 0.1?

The first image is an abstraction that would roughly represent the domain we’re trying to map with discursus. The top layer is the protest movement phenomenon itself, whereas the bottom layer is how that phenomenon is being reported.

Abstraction for protest movements

We don’t have access to that top layer directly, we can only use the observers as a proxy to what really happened. All the challenge is there - how can we reconstruct the actual phenomenon by using the observer artifacts as our raw material. That definitely means we’ll also need to take into account observer biases eventually.

The second image is what the data warehouse entities look like and how they relate to each other.

discursus platform’s Entity Relationships Diagram

A few notes:

Implementing the Semantic Layer with Droughty

For me, the semantics layer boils down to documenting a domain’s entities, attributes, metrics and relationships. Droughty (from Lewis Baker at Rittman Analytics) helps me do that easily and efficiently.

My development workflow now looks like the following:

I get an an up-to-date dbml definition by running droughty dbml.

discursus dbml definition generated by Droughty

I get my Cube definitions by running droughty cube.

discursus Cube definitions generated by Droughty

You can also use droughty to automatically generate lookml and dbt tests. Just a super useful tool to add to your stack.

Other Improvements

Protest Grouping

A new entity that is now part of the semantic layer is protests_dim. This represents the protest movements in our abstraction graphic.

We currently manually query protest events by using some criteria such as country where an event occurred, timeframe, keywords in article descriptions, etc. Until we introduce machine learning to group events together, we want to come up with a configuration engine that will associate events to protest configs.

And this is where we introduced those new components to our stack.

discursus architecture

From a google sheet, we manually configure how we want to group protest events together. Those configurations are then being sourced with Airbyte directly to Snowflake. And this is then being used to build our protests_dim entity.

protests_dim processing flow

Now that we have those groupings, we can use them to select a specific protest movement to analyse in our monitoring dashboard.

Selection of protest movement in our monitoring dashboard

Data Assets

In our data products diagram above, we have “data asset” boxes, which are the endpoints of each data product. Those are not abstract concepts, but real objects in Dagster, that are useful to track attributes as well as to trigger dependant transformations for other data products.

An improvement we’ve made in this release is to have all those endpoints materialized as data assets. We can now track the performance of their computation as well as see how their attributes change through time.

List of data assets materialized in Dagster

Attributes of a data asset as seen in Dagster

Performance Improvements

One thing we meant to do for a while was to start using dbt incremental tables for our largest models. But we needed a way to control when jobs were to be ran as an incremental run or a full refresh run.

Using Dagster configs, we can now control how we want to run dbt ops from either the Launchpad…

Configuring full-refresh runs in Dagster

Or from schedules..

@schedule(job=build_data_warehouse, cron_schedule="15 3,9,15,21 * * *")
def build_data_warehouse_schedule(context: ScheduleEvaluationContext):
    return RunRequest(
        run_key=None,
        run_config={
            "ops": {"build_dw_staging_layer": {"config": {"full_refresh_flag": False}}},
            "ops": {"build_dw_integration_layer": {"config": {"full_refresh_flag": False}}},
            "ops": {"build_dw_warehouse_layer": {"config": {"full_refresh_flag": False}}}
        }
    )

As we can see in our Core Entities data warehouse job, our dbt runs now sets a --full-refresh argument when set to True.