Skip to main content

Using dbt

Overview

If you are a Snowplow BDP customer, you can get started with configuring and deploying dbt projects as outlined in the steps below. For more information about setting up your dbt project you can look at the Snowplow dbt docs.

As an initial overview, in your snowplow-pipeline repository, your data models reside in the dbt directory. To start with, your GitHub repository will look like this (you may have additional folders based on your project e.g. dbt_packages or docs):

.
├── datamodeling
| ├── datamodel_name
| └── dbt
| ├── analyses
| ├── logs
| ├── macros
| └── models
| └── my_custom_datamodel
| ├── model_name_1.sql
| └── model_name_2.sql
| ├── seeds
| ├── snapshots
| ├── tests
| ├── README.md
| ├── dbt_project.yml
| └── profiles.yml

Inside the datamodeling directory you can create the data models as subdirectories, giving them an appropriate name. The allowable characters for naming are a-z,0-9,-. Under this you must create a subdirectory named dbt

Each dbt subdirectory contains your dbt project folders, with the dbt_project.yml file at the top level of this folder. For further information about the structure of a dbt project see the dbt docs.

info

When the schedule kicks off, the data model configuration is loaded and validated. A corresponding data modeling DAG is autogenerated accordingly. This DAG will only contain 2 steps, a git clone cache and a dbt run command, dbt manages all model dependencies within the run command.

Read below for more details on the steps to configure and run your dbt data models with Snowplow BDP.

1. Setup your dbt profile

You need to provide your prod connection profile for the warehouse you are connecting to in the profiles.yml file for each datamodel. Ensure that your profile and target are set to prod. See the dbt adapters docs for more specific configuration information for each database.

See dbt connection setup for more information.

profile_name:
outputs:
prod:
type: redshift
threads: [1 or more]
host: [hostname.region.redshift.amazonaws.com]
port: 5439
user: datamodeling
pass: "{{ env_var('REDSHIFT_DATAMODELING_PASSWORD') }}"
dbname: [database name]
schema: [schema name]
target: prod
info

The warehouse password should be sent by secure form from the Snowplow BDP Console in order to set the environment variables.

2. The data modeling configuration

Data models can be configured via the Snowplow BDP Console:

In the first step, you can provide the data model name, a description, and the owner(s) of the data model that will be alerted in case of failure. Finally select the engine and version that should run your datamodel (dbt in this case):

Please note that the model name needs to match to the corresponding datamodeling subfolder name in GitHub. The dbt project for the model in the above example would therefore live under:

your snowplow-pipeline repo > datamodeling > data model name > dbt ...

In the second step, you can then add the schedule:

Please note that the cron schedule is in UTC.

caution

You should write your dbt models in such a way that if they fail part way through they can be re-run without issue. All Snowplow models are written in this way and if you write your models following the custom modules documentation you can take advantage of this.

In the third and last step you can configure the arguments and variables your dbt run call is excecuted with. The arguments (--selector or --select / --exclude) follow the syntax described on dbt docs, and so do variables (--vars flag):

Once you are happy with the configuration, you can create the model. The model will be disabled until you enable it:

Please make sure all your dbt project files are merged to the default branch in GitHub before enabling the model. Any changes merged to the default branch are available immediately.

3. Monitor your model in the Snowplow BDP Console

Everything is set and you can now monitor your data models running against your data warehouse from the Snowplow BDP Console, in the Jobs UI! There you can see the data modeling DAG generated, and monitor the status, duration and run times both at the data model and at the playbook level.

Was this page helpful?