Getting Started
Note
This documentation is for pipeline versions R118+. If you are unsure of which version your pipeline is running, please contact support.
Event recovery at its core, is the ability to fix events that have failed and replay them through your pipeline.
After inspecting failed events either in the Snowplow BDP Console, or in the partitioned failure buckets, you can determine which events are possible to recover based on what the fix entails.
With recovery it is possible to:
- replace values - e.g. correct a typo in a schema name for validation
- remove values - e.g. remove improperly encoded values from a URL string
- cast JSON types - e.g. change a property's type from
string
tointeger
If your failed events would not be fixed by applying the above, they currently would be considered unrecoverable. Due to the fact that there might be a mix of recoverable and unrecoverable data in your storage, event recovery uses configuration in order to process only a subset of the failed events.
What you'll need to get started
The typical flow for recovery and some prerequisites to consider would be:
Understanding the failure issue
- Familiarity with the failed event types
- Access to S3 or GCS buckets with failed events
Configuring a recovery
- Understanding the configuration structure
- Comfort with using Regular Expressions (RegEx)
Testing the configuration
- Ability to edit/run a Scala script locally
Run the recovery
- AWS sub account or GCP project admin access in order to create a recovery user
Monitor the recovery
- Access to DataFlow UI (GCP) or EMR reporting (AWS)