Troubleshooting

Monitoring

In order to verify the process is running properly there are several locations that can be monitored, depending on your runtime environment these are: datasinks (for recovery job processed output): failedOutput (S3/GCS bucket), unrecoverableOutput (S3/GCS bucket), output (Kinesis/PubSub streams) and job runners (tracking job status and processing in real-time).

Possible Failures

There are two main reasons why recovery process may fail. Process Failures preventing recovery process to proceed and Recovery Failures that mean recovery flow cannot be applied to given bad row/line. Another set of issues is unexpected behavior of the application. The three failure types need to be handled differently.

Process Failures

Process failures are usually caused by misconfiguration. They usually happen as the job is submitted to its runner (Apache Beam, Spark, Flink). Reported failures may differ, but usually they boil down to:

Missing Resources

This means that resources set as input or output for recovery do not exist or are inaccessible to application.

Action: Check the existence of resources provided to job configuration and their access policies.

Configuration Issues

The largest group of issues causing early process termination are configuration issues. They are reported in CLI as the job is submitted.

Action: Make sure the configuration is well formed and it is conforming its schema.

Recovery Failures

Definitely the largest and the most common issue with automated recovery process are recovery failures. These are runtime errors that prevent application from recovering given lines / bad rows. These failures will produce values under failedOutput directory.

Recovery error is yet another bad row type that is able to be recovered. For convenience recovery errors use original configurations and do not require another recovery flows configured especially for them.

Recovery error contain additional information pointing to:

recovery flow name
number of recoveries attempted
original bad row

Missing recovery configuration

Happens when trying to recover a line for which no recovery flow configuration can be matched.

Action: Check whether supplied configuration contains a proper mapping for problematic bad row instance including its schema key and conditions that are applied before choosing configuration.

Replacement Failures

Actual replacement specified with Replace Step has failed. Reason will be supplied along with exported RecoveryError bad row under unrecoveredOutput.

Action: consult the exported message. This likely means that configuration supplied to modify the source bad row needs adjusting. Either with another recovery flow (when faulty row is a special case) or direct adjustment to recovery flow supplied with RecoveryError.

Cast Failures

Much like Replacement Failures, Cast Failures occur when modifying actual value with a Cast Step has failed. The reason is supplied along with exported RecoveryError bad row under unrecoveredOutput.

Unrecoverable Bad Row Type

Some bad row types are marked as unrecoverable and therefore even with an existing configuration they will not be transformed. They are outputted to unrecoverable.

Action: If you think that specific bad row type should be recoverable consult Extending Recovery and consider submitting a Pull Request.

Invalid JSON Format

Happens whenever input data is not a proper JSON format.

Action: Check that input JSON file contains a proper JSON object in each line.

Invalid Data Format

Happens whenever input data is not a valid BadRow JSON format.

Action: Check that input JSON file contains a proper BadRow JSON object in each line.

Codec Failures

This set of failures should not really occur but means that there was an error serializing data with thrift (ThriftFailure) or Base64 coding. Some of these issues might be related to runtime locale.

Action: Consult supplied error message outputted along with BadRow into unrecoveredOutput. Verify that bad row producer source has the same locale as recovery worker - UTF-8.

Unexpected Behavior

This is hopefully the rarest but the least specific set of issues that may occur. In case any unexpected behavior happens consider submitting a GitHub issue.

Monitoring​

Possible Failures​

Process Failures​

Missing Resources​

Configuration Issues​

Recovery Failures​

Missing recovery configuration​

Replacement Failures​

Cast Failures​

Unrecoverable Bad Row Type​

Invalid JSON Format​

Invalid Data Format​

Codec Failures​

Unexpected Behavior​