Query your data
If you were using the default
or secure
example scripts unedited in the last section, you will have created a Postgres database where all of your data is stored. Your database will contain the following standard Snowplow schemas:
- Postgres
- BigQuery
atomic
: this is your rich, high quality dataatomic_bad
: this is the data that has failed pipeline validation
events
: this is your rich, high quality data
Step 1. Querying your good data
To query the good data, you will first you need to connect to your database.
- Postgres
- BigQuery
- Connect to the database using the username and password you provided when creating the pipeline, along with the
db_address
anddb_port
you noted down after the pipeline was created. - Run a query against your
atomic.events
table to take a look at the page view event that you generated in the previous step (where event_name = 'page_view'
). You can understand more about each field in the canonical event here.SELECT * FROM atomic.events WHERE event_name = 'page_view';
Within GCP:
- Navigate to the BigQuery UI: https://console.cloud.google.com/bigquery
- In the Explorer open up the available datasets and select your newly created one
<prefix>_snowplow_db
- Select the
events
table (this should show all the fields available for querying) - Select
query
the top panel and open in anew tab
- Execute the following (this should almost exactly resemble the default query GCP produces):
SELECT * FROM '<project_id>.<prefix>_snowplow_db.events' WHERE DATE(collector_tstamp) = "YYYY-MM-DD" LIMIT 1000
To run a query against your events table to take a look at the page view event that you generated in the previous step (where event_name = 'page_view'
). You can understand more about each field in the canonical event here.
SELECT * FROM '<project_id>.<prefix>_snowplow_db.events' WHERE DATE(collector_tstamp) = "YYYY-MM-DD" AND event = 'page_view' LIMIT 1000
By default, there are 5 enrichments enabled, as listed below. These enrichments add extra properties and values to your events in real time as they are being processed by the Enrich application.
Some enrichments are legacy and therefore populate your events table. From the above list, these are the campaign attribution, referer parser and event fingerprint enrichments. The UA parser and YAUAA enrichment also add a separate entity to each event (these are also referred to as contexts since they add additional context to the events in your events table).
- Postgres
- BigQuery
The contexts are loaded into separate tables:
atomic.com_snowplowanalytics_snowplow_ua_parser_context_1
atomic.nl_basjes_yauaa_context_1
Note: you can join these contexts back to your atomic.events
using root_id = event_id
.
The contexts are loaded into separate columns of the events table:
contexts_com_snowplowanalytics_snowplow_ua_parser_context_1_0_0
contexts_nl_basjes_yauaa_context_1_0_3
Step 2. Querying your bad data
In the last section, we sent a test event that would fail to be processed by your pipeline (specifically one that fails to validate against a schema). This is a fundamental aspect of Snowplow; ensuring that only good quality data reaches your stream, lake and warehouse and syphoning off poor quality data so that you have the ability to correct and recover it.
As the custom product_view
event passed through your pipeline, the Enrich application fetches the schema for the event. It does this so it can validate that the structure of the event conforms to what was defined up front, therefore ensuring it is of the quality expected. Since the schema for the product_view
event doesn't yet exist in your Iglu schema registry, the event failed to validate.
- Postgres
- BigQuery
Your atomic_bad
schema holds events that have failed to be processed by your pipeline. These are called failed events.
You will see in Postgres that you have a table called atomic_bad.com_snowplowanalytics_snowplow_badrows_schema_violation_1.
You might also see adapter failure failed events in Postgres. Many adaptor failures are caused by bot traffic, so do not be surprised to see some of them in your pipeline. Find out more here.
Currently the Quick Start guide pipelines do not load bad data into BigQuery directly.
To query it, you will need to follow a few additional steps manually:
- Configure the GCS Loader to load bad data from the Bad 1 (
<prefix>-bad-1-topic
) and Bad Rows (<prefix>-bq-bad-rows-topic
) pubsub topics into GCS. - Create external tables in BigQuery to read the data from GCS as described here.
In the next section, we guide you through creating a custom schema so that your custom event would validate against it and not become a failed event.
Learn more about self-describing events and schemas, and the different types of failures here.
Next, let's take a detailed look at what you have deployed.