Creating custom models
On this page, <package>
can be one of: web
, mobile
, ecommerce
, normalize
Custom models
The Snowplow packages are designed to be easily customized or extended within your own dbt project by building your own custom models. The standard models we provide (base, page/screen views, sessions and users) are not designed to be modified by you. An example dbt project with custom models can be seen in the custom example directory of the snowplow-web repo and snowplow-mobile repo.
In general there are 3 types of custom models you may end up building:
- Full refresh/view models: these models build from the derived tables but require a full refresh each time they are built. This could include things like daily aggregations or rolling views.
- Incremental models: these models build from the existing (or custom)
_this_run
tables(s) and are incrementally upserted to each run. This could include things like a custom sessions table with additional user information. These are the most common types of custom models. - This run models: these models build from the
base_events_this_run
table that contains all columns for events being (re)processed in the run. These can be used to feed custom incremental models.
Guidelines & Best Practice
The standard models carry out the heavy lifting in establishing an incremental structure and providing the core logic for the most common aggregation use cases. It also allows custom models to be plugged in without impeding the maintenance of standard models.
The following best practices should be followed to ensure that updates and bug fixes to the model can be rolled out with minimal complication:
- Custom models should not modify any of the tables generated by the Snowplow packages i.e. the scratch, derived or manifest tables.
- Customizations should not modify the SQL provided by the package - they should only comprise of a new set of SQL statements, which produce a separate table.
- The logic for custom SQL should be idempotent, and restart-safe - in other words, it should be written in such a way that a failure mid-way, or a re-run of the model will not change the deterministic output.
- Redshift/Postgres users: be careful about joining context or self-describing event tables onto the main events table. Deduplication logic needs to be applied to avoid many-to-many joins. We have provided a macro -
get_sde_or_context()
- for you to use for this purpose from v0.14.0 of the snowplow-utils package. Check out the duplicates section and the package documentation on how to use it.
In short, the standard models can be treated as the source code for a distinct piece of software, and custom models can be treated as self-maintained, additive plugins - in much the same way as a Java package may permit one to leverage public classes in their own API, and provide an entry point for custom programs to run, but will not permit one to modify the original API.
The _this_run
and derived tables are considered part of the 'public' class of tables in this model structure, and so we can give assurances that non-breaking releases won't alter them. The other tables may be used in custom SQL, but their logic and structure may change from release to release, or they may be removed. If one does use a scratch table in custom logic, any breaking changes can be mitigated by either amending the custom logic to suit, or copying the relevant steps from an old version of the model into the custom model. (However this will rarely be necessary).
Creating a custom model
Full refresh/view model
These models build directly off the derived
tables and can be built the same way you would any other dbt model, using either the table
or view
as the materialization and referencing the derived table using the {{ ref(...) }}
syntax. These models should NOT be tagged with the snowplow_<package>_incremental
tag.
Note that because these models are not tagged, they will not be run when using the snowplow_<package>
selector.
Incremental models
As these models form part of the incremental processing of the package, these models should be tagged with snowplow_<package>_incremental
in order to leverage the incremental logic of this package. We recommend creating a sub directory of your /models
directory to contain all your custom models. In this example we created the sub directory snowplow_<package>_custom_models
. We can then apply the tag to all models in this directory:
# dbt_project.yml
models:
your_dbt_project:
snowplow_<package>_custom_models:
+tags: snowplow_<package>_incremental #Adds tag to all models in the 'snowplow_<package>_custom_models' directory
These models should also make use of the optimized materialization
set such that materialized='incremental
and snowplow_optimize=true
in your model config. Finally, as well as referencing a _this_run
table these models should make use of the is_run_with_new_events
macro to only process the table when new events are available in the current run. This macro snowplow_utils.is_run_with_new_events(package_name)
will evaluate whether the particular model, i.e. {{ this }}
, has already processed the events in the given run of the model. This is returned as a boolean and effectively blocks the upsert to incremental models if the run only contains old data. This protects against your derived incremental tables being temporarily updated with incomplete data during batched back-fills of other models.
{# /models/snowplow_<package>_custom_models/my_custom_model.sql #}
{{
config(
materialized='incremental',
upsert_date_key='my_date_field',
unique_key='unique_id_field',
snowplow_optimize=true,
...
)
}}
select
...
from {{ ref('snowplow_<package>_<table>_this_run') }}
where {{ snowplow_utils.is_run_with_new_events('snowplow_<package>') }} --returns false if run doesn't contain new events.
By doing all of this you ensure your table will be managed correctly in the manifest tables and make use of the incremental logic of the package. Listed below are the recommended tables to reference as your input for a custom model, depending on the level of aggregation required:
- Snowplow Web
- Snowplow Mobile
- Snowplow Normalize
- Snowplow E-commerce
- Event level:
snowplow_web_base_events_this_run
- Page view level:
snowplow_web_page_views_this_run
- Session level:
snowplow_web_sessions_this_run
- User level:
snowplow_web_users_this_run
- Event level:
snowplow_mobile_base_events_this_run
- Screen view level:
snowplow_mobile_screen_views_this_run
- Session level:
snowplow_mobile_sessions_this_run
- User level:
snowplow_mobile_users_this_run
- Event level:
snowplow_normalize_base_events_this_run
- Event level:
snowplow_ecommerce_base_events_this_run
- Cart level:
snowplow_ecommerce_cart_interactions_this_run
- Checkout level:
snowplow_ecommerce_checkout_interactions_this_run
- Product level:
snowplow_ecommerce_product_interactions_this_run
- Session level:
snowplow_ecommerce_sessions_this_run
- Transaction level:
snowplow_ecommerce_transaction_interactions_this_run
This run models
If you need to produce a custom _this_run
type model you should build this model off the relevant package's snowplow_<package>_base_events_this_run
table. This table has all the columns from your events table in, including any self describing events and custom contexts (except in Redshift). Note that this table has some custom logic in each package to identify which events to include in each run, including your app_ids
filter. For more information about the way this works see the incremental logic page, but the key takeaway is all events with a session ID that exists in events since the last processing are included in the next run - the session ID field varies by package and the normalize package does not have this filter.
To build a custom _this_run
table you need to ensure that this model is materialized as a table
and you use {{ ref(snowplow_<package>_base_events_this_run) }}
. This type of model does NOT need to be tagged. You can then use this model in other Incremental type custom models.
Retiring Custom models
This process is the same for all types of custom model.
If you want to retire a custom model, you should delete the models from your project or disable the models.
There is no need to remove the models from the snowplow_<package>_incremental_manifest
manifest table. The packages identifies enabled models tagged with snowplow_<package>_incremental
within your project and selects these models from the manifest in order to calculate the state of the web model as described above.
Do NOT just use --exclude
on the retired models from your job in production. Currently the packages is unable to identify which models are due to be executed in a given run. As a result, if you exclude a model the package will get stuck in State 3 and continue to attempt to sync your excluded with the remaining models.
Back-filling custom models
Full refresh/view model & This run models
As these models are rebuilt fully each time the package is run, there is no special action to take to backfill these types of models.
Incremental Models
Over time you may wish to add more custom incremental models to extend the functionality of this package. As you introduce new custom models into your project, assuming they are tagged correctly, the package will automatically replay all events up until the latest event to have been processed by the other models.
During back-filling, the derived tables are blocked from updating. This is to protect against a batched back-fill temporarily introducing incomplete data into these derived tables. The batch size of this back-fill is limited as outlined in the identification of events to process section, this means it might take several runs to complete the back-fill, during which time no new events will be processed by the main models.
Back-filling a model can be performed either as part of the entire run of the Snowplow package, or in isolation to reduce cost (recommended):
dbt run --select snowplow_<package> tag:snowplow_<package>_incremental # Will execute all Snowplow <package> models, as well as custom.
dbt run --select +my_custom_model # Will execute only your custom model + any upstream nodes.
Tearing down and restarting a subset of models
Full refresh/view model & This run models
As these models are rebuilt fully each time the package is run, there is no special action to take to restart these types of models.
Incremental models
As the code base for your custom incremental models evolves, you will likely need to replay events through a given model. In order to do this you must:
- Manually drop the table(s) from your custom model(s) in your database (you may wish to simply rename them until the back-fill is completed in case of any issues).
- Remove the models from the manifest table (See the Complete refresh section for an explanation as to why), this can be achieved either by:
- (Recommended) using the
models_to_remove
variable at run timedbt run --select +snowplow_<package>_custom_incremental_model --vars '{snowplow__start_date: "yyyy-mm-dd", models_to_remove: snowplow_<package>_custom_incremental_model}'
- (High Risk) manually deleting the record from the
snowplow_<package>_incremental_manifest
table.
- (Recommended) using the
By removing the snowplow_<package>_custom_incremental_model
model from the manifest the <package> packages will be in State 2 and will replay all events.
If you want to replay events through a series of dependent models, you only need to pass the name of the endmost model within the run.
Tips for developing custom models
Most of these tips apply only to Incremental type custom models
While developing custom models you may benefit from the following:
- Minimizing the amount of data being processed to reduce cost & run time.
- Use recent events from your events table to ensure you have all the latest contexts and event types available.
- BigQuery only: Automatic handling of evolving schemas for custom contexts and unstructured events.
1. Reducing Costs
By setting snowplow__backfill_limit_days
to 1 in your dbt_project.yml
file you will only process a days worth of data per run.
We have provided the get_value_by_target
macro to dynamically switch the backfill limit depending on your environment i.e. dev vs. prod, with your environment determined by your target name:
# dbt_project.yml
...
vars:
snowplow_<package>:
snowplow__backfill_limit_days: "{{ snowplow_utils.get_value_by_target(
dev_value=1,
default_value=30,
dev_target_name='dev') }}"
2. Using Recent Data
This can be achieved by setting snowplow__start_date
to a recent date. To dynamically change the start date depending on your environment, you can use the following:
# dbt_project.yml
...
vars:
snowplow_<package>:
snowplow__start_date: "{{ snowplow_utils.get_value_by_target(
dev_value=snowplow_utils.n_timedeltas_ago(1, 'weeks'),
default_value='2020-01-01',
dev_target_name='dev') }}"
3. Handling of schema evolution
As your schemas for such custom contexts and unstructured events evolve, multiple versions of the same column will be created in your events table e.g. custom_context_1_0_0
, custom_context_1_0_1
. These columns contain nested fields i.e. are of a datatype RECORD
. When modeling Snowplow data it can be useful to combine or coalesce each nested field across all versions of the column for a continuous view over time.
The snowplow-utils package provides the combine_column_versions macro, which will automatically coalesce the fields within every version of the specified column. This mitigates the need for you to update your models every time a new column version is created.
Please refer to the snowplow-utils docs for the full documentation on these macros.