You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could...

Question

You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the data. What should you do?

Accepted Answer

A is incorrect because 1) multi-region buckets do not provide the lowest latency or sufficient bandwidth, and 2) multi-region buckets have an RPO of 1 hour (ie. data from the last hour could be lost).
B is correct because dual-region buckets with turbo-replication have an RPO of 15 mins as required. Dataproc cluster gets redeployed to the available region and is colocated with the bucket.
C is incorrect because STS only allows hourly transfers, which would not meet the 15 min RPO.
D is incorrect because the primary Dataproc cluster runs in a different region to the bucket, increasing latency.

Ready to practice?