Databricks loader

Setting up Databricks

The following resources need to be created:

Also, events table needs to be created before starting the application for the first time:

Statement for creating events table

CREATE TABLE IF NOT EXISTS snowplow.events (
  -- App
  app_id                      VARCHAR(255),
  platform                    VARCHAR(255),
  -- Date/time
  etl_tstamp                  TIMESTAMP,
  collector_tstamp            TIMESTAMP       NOT NULL,
  dvce_created_tstamp         TIMESTAMP,
  -- Event
  event                       VARCHAR(128),
  event_id                    VARCHAR(36)     NOT NULL,
  txn_id                      INTEGER,
  -- Namespacing and versioning
  name_tracker                VARCHAR(128),
  v_tracker                   VARCHAR(100),
  v_collector                 VARCHAR(100)    NOT NULL,
  v_etl                       VARCHAR(100)    NOT NULL,
  -- User and visit
  user_id                     VARCHAR(255),
  user_ipaddress              VARCHAR(128),
  user_fingerprint            VARCHAR(128),
  domain_userid               VARCHAR(128),
  domain_sessionidx           SMALLINT,
  network_userid              VARCHAR(128),
  -- Location
  geo_country                 VARCHAR(2),
  geo_region                  VARCHAR(3),
  geo_city                    VARCHAR(75),
  geo_zipcode                 VARCHAR(15),
  geo_latitude                DOUBLE,
  geo_longitude               DOUBLE,
  geo_region_name             VARCHAR(100),
  -- IP lookups
  ip_isp                      VARCHAR(100),
  ip_organization             VARCHAR(128),
  ip_domain                   VARCHAR(128),
  ip_netspeed                 VARCHAR(100),
  -- Page
  page_url                    VARCHAR(4096),
  page_title                  VARCHAR(2000),
  page_referrer               VARCHAR(4096),
  -- Page URL components
  page_urlscheme              VARCHAR(16),
  page_urlhost                VARCHAR(255),
  page_urlport                INTEGER,
  page_urlpath                VARCHAR(3000),
  page_urlquery               VARCHAR(6000),
  page_urlfragment            VARCHAR(3000),
  -- Referrer URL components
  refr_urlscheme              VARCHAR(16),
  refr_urlhost                VARCHAR(255),
  refr_urlport                INTEGER,
  refr_urlpath                VARCHAR(6000),
  refr_urlquery               VARCHAR(6000),
  refr_urlfragment            VARCHAR(3000),
  -- Referrer details
  refr_medium                 VARCHAR(25),
  refr_source                 VARCHAR(50),
  refr_term                   VARCHAR(255),
  -- Marketing
  mkt_medium                  VARCHAR(255),
  mkt_source                  VARCHAR(255),
  mkt_term                    VARCHAR(255),
  mkt_content                 VARCHAR(500),
  mkt_campaign                VARCHAR(255),
  -- Custom structured event
  se_category                 VARCHAR(1000),
  se_action                   VARCHAR(1000),
  se_label                    VARCHAR(4096),
  se_property                 VARCHAR(1000),
  se_value                    DOUBLE,
  -- Ecommerce
  tr_orderid                  VARCHAR(255),
  tr_affiliation              VARCHAR(255),
  tr_total                    DECIMAL(18,2),
  tr_tax                      DECIMAL(18,2),
  tr_shipping                 DECIMAL(18,2),
  tr_city                     VARCHAR(255),
  tr_state                    VARCHAR(255),
  tr_country                  VARCHAR(255),
  ti_orderid                  VARCHAR(255),
  ti_sku                      VARCHAR(255),
  ti_name                     VARCHAR(255),
  ti_category                 VARCHAR(255),
  ti_price                    DECIMAL(18,2),
  ti_quantity                 INTEGER,
  -- Page ping
  pp_xoffset_min              INTEGER,
  pp_xoffset_max              INTEGER,
  pp_yoffset_min              INTEGER,
  pp_yoffset_max              INTEGER,
  -- User Agent
  useragent                   VARCHAR(1000),
  -- Browser
  br_name                     VARCHAR(50),
  br_family                   VARCHAR(50),
  br_version                  VARCHAR(50),
  br_type                     VARCHAR(50),
  br_renderengine             VARCHAR(50),
  br_lang                     VARCHAR(255),
  br_features_pdf             BOOLEAN,
  br_features_flash           BOOLEAN,
  br_features_java            BOOLEAN,
  br_features_director        BOOLEAN,
  br_features_quicktime       BOOLEAN,
  br_features_realplayer      BOOLEAN,
  br_features_windowsmedia    BOOLEAN,
  br_features_gears           BOOLEAN,
  br_features_silverlight     BOOLEAN,
  br_cookies                  BOOLEAN,
  br_colordepth               VARCHAR(12),
  br_viewwidth                INTEGER,
  br_viewheight               INTEGER,
  -- Operating System
  os_name                     VARCHAR(50),
  os_family                   VARCHAR(50),
  os_manufacturer             VARCHAR(50),
  os_timezone                 VARCHAR(255),
  -- Device/Hardware
  dvce_type                   VARCHAR(50),
  dvce_ismobile               BOOLEAN,
  dvce_screenwidth            INTEGER,
  dvce_screenheight           INTEGER,
  -- Document
  doc_charset                 VARCHAR(128),
  doc_width                   INTEGER,
  doc_height                  INTEGER,
  -- Currency
  tr_currency                 VARCHAR(3),
  tr_total_base               DECIMAL(18, 2),
  tr_tax_base                 DECIMAL(18, 2),
  tr_shipping_base            DECIMAL(18, 2),
  ti_currency                 VARCHAR(3),
  ti_price_base               DECIMAL(18, 2),
  base_currency               VARCHAR(3),
  -- Geolocation
  geo_timezone                VARCHAR(64),
  -- Click ID
  mkt_clickid                 VARCHAR(128),
  mkt_network                 VARCHAR(64),
  -- ETL tags
  etl_tags                    VARCHAR(500),
  -- Time event was sent
  dvce_sent_tstamp            TIMESTAMP,
  -- Referer
  refr_domain_userid          VARCHAR(128),
  refr_dvce_tstamp            TIMESTAMP,
  -- Session ID
  domain_sessionid            VARCHAR(128),
  -- Derived timestamp
  derived_tstamp              TIMESTAMP,
  -- Event schema
  event_vendor                VARCHAR(1000),
  event_name                  VARCHAR(1000),
  event_format                VARCHAR(128),
  event_version               VARCHAR(128),
  -- Event fingerprint
  event_fingerprint           VARCHAR(128),
  -- True timestamp
  true_tstamp                 TIMESTAMP,
  -- Collector timestamp date for partitioning
  collector_tstamp_date       DATE GENERATED ALWAYS AS (DATE(collector_tstamp))
)
PARTITIONED BY (collector_tstamp_date, event_name);

Minimal Security Configuration (optional)

The security principal used by loader needs Databricks SQL access permission which can be enabled in the Admin Console.

Databricks does not have table access enabled by default. Enable it with the init script:

Init script creation in scala notebook

dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
|#!/bin/bash
|
|cat << 'EOF' > /databricks/driver/conf/00-custom-table-access.conf
|[driver] {
|  "spark.databricks.acl.sqlOnly" = "true"
|}
|EOF
""".stripMargin, true)

After adding the script, you need to restart the cluster. Verify that changes took effect by evaluating spark.conf.get("spark.databricks.acl.sqlOnly"), which should return true.

Start the loader so that it can create the tables. Now the cluster is ready to be configured:

-- Clean initial permissions from tables
REVOKE ALL PRIVILEGES ON TABLE <catalog>.<schema>.events FROM `<principal>`;
REVOKE ALL PRIVILEGES ON TABLE <catalog>.<schema>.manifest FROM `<principal>`;
REVOKE ALL PRIVILEGES ON TABLE <catalog>.<schema>.rdb_folder_monitoring FROM `<principal>`;

-- Clean initial permissions from schema
REVOKE ALL PRIVILEGES ON SCHEMA <catalog>.<schema> FROM `<principal>`;

-- Loader will run CREATE TABLE IF NOT EXISTS statements, so USAGE and CREATE and both required.
GRANT USAGE, CREATE ON SCHEMA <catalog>.<schema> TO `<principal>`;

-- COPY TO statement requires ANY FILE and MODIFY for the receiving table
GRANT SELECT ON ANY FILE TO `<principal>`;
GRANT MODIFY  ON TABLE  <catalog>.<schema>.events TO `<principal>`;

-- These tables are used to store internal loader state
GRANT MODIFY, SELECT ON TABLE  <catalog>.<schema>.manifest TO `<principal>`;
GRANT MODIFY, SELECT ON TABLE  <catalog>.<schema>.rdb_folder_monitoring TO `<principal>`;

Downloading the artifact

The asset is published as a jar file attached to the Github release notes for each version.

It's also available as a Docker image on Docker Hub under snowplow/rdb-loader-databricks:5.4.1.

Configuring `rdb-loader-databricks`

The loader takes two configuration files:

a config.hocon file with application settings
an iglu_resolver.json file with the resolver configuration for your Iglu schema registry.

An example of the minimal required config for the Databricks loader can be found here and a more detailed one here. For details about each setting, see the configuration reference.

See here for details on how to prepare the Iglu resolver file.

tip

All self-describing schemas for events processed by RDB Loader must be hosted on Iglu Server 0.6.0 or above. Iglu Central is a registry containing Snowplow-authored schemas. If you want to use them alongside your own, you will need to add it to your resolver file. Keep it mind that it could override your own private schemas if you give it higher priority. For details on this see here.

Running the Databricks loader

The two config files need to be passed in as base64-encoded strings:

$ docker run snowplow/rdb-loader-databricks:5.4.1 \
--iglu-config $RESOLVER_BASE64 \
--config $CONFIG_BASE64

Telemetry notice

By default, Snowplow collects telemetry data for Databricks Loader (since version 5.0.0). Telemetry allows us to understand how our applications are used and helps us build a better product for our users (including you!).

This data is anonymous and minimal, and since our code is open source, you can inspect what’s collected.

If you wish to help us further, you can optionally provide your email (or just a UUID) in the telemetry.userProvidedId configuration setting.

If you wish to disable telemetry, you can do so by setting telemetry.disable to true.

See our telemetry principles for more information.

Setting up Databricks​

Minimal Security Configuration (optional)​

Downloading the artifact​

Configuring rdb-loader-databricks​

Running the Databricks loader​

Setting up Databricks

Minimal Security Configuration (optional)

Downloading the artifact

Configuring `rdb-loader-databricks`

Running the Databricks loader