GCP API errors causing issues for Qubole cluster operations
Incident Report for Qubole
Resolved
Qubole DevOps has restored the environment to the working state prior to the GCP API outage. A series of thorough application functionality verifications were performed and passed. We appreciate your patience through this process.
Posted Oct 08, 2020 - 17:30 PDT
Update
As updated earlier Qubole DevOps has applied additional workarounds and verified that cluster operations are continuing to be functional. However, we are continuing to work with the Google support team to resolve this issue permanently. The latest update is as below:
"Google's API Discovery Service GetRest (https://www.googleapis.com/discovery/v1/apis/pubsub/v1/rest) requests are hanging in the following regions: asia-northeast1, asia-northeast2, asia-northeast3, asia-southeast1, europe-west1,europe-west3, europe-west6, europe-west4, northamerica-northeast1,southamerica-east1,us-central1, us-east1, us-west1, us-west2, and us-west4.

We are currently working to mitigate by rolling back a configuration change. Next update time is Thursday, 2020-10-08 07:00 US/Pacific."

The same is available on https://status.cloud.google.com/.
Posted Oct 08, 2020 - 01:53 PDT
Monitoring
Qubole DevOps has applied additional workarounds and verified that cluster operations are functional again. We will continue to track the incident on the GCP side and determine the best course of action to re-configure the workarounds thereafter. We appreciate your patience through this process. If you see further issues, please don't hesitate to reach out to Qubole Support.
Posted Oct 07, 2020 - 18:00 PDT
Identified
Qubole DevOps has determined that the temporary workaround is not working consistently and is working on additional potential solutions. We are working closely with GCP Support and also staying apprised of their statuspage updates with their latest update as follows:

"Google's API Discovery Service GetRest (https://www.googleapis.com/discovery/v1/apis/pubsub/v1/rest) requests are hanging in the following regions: asia-northeast1, asia-northeast2, asia-northeast3, europe-west3, europe-west6, northamerica-northeast1, southamerica-east1, us-west2, and us-west4.

We are currently working to mitigate by rolling back a configuration change. We expect the rollout to complete within the next 7 hours. Next update time is Wednesday, 2020-10-07 23:15 US/Pacific."
Posted Oct 07, 2020 - 17:09 PDT
Monitoring
Qubole DevOps has been working with GCP support and have implemented a short-term workaround which has improved the ability to start/stop clusters. Note that during our investigation, GCP support has posted a public statuspage update here: https://status.cloud.google.com. We will monitor the situation and determine if additional steps are required to stabilize performance of the service.
Posted Oct 07, 2020 - 16:34 PDT
Update
Qubole DevOps continues to debug the current issue with GCP Support assistance. We will keep you updated when we have more information.
Posted Oct 07, 2020 - 14:23 PDT
Investigating
Qubole DevOps was alerted to issues with cluster operations in the GCP environment. While we do not see a public statuspage incident from GCP at this time, we have detected large amounts of timeouts on standard GCP API calls. We are opening a critical investigation with GCP support and will update as we gain further insights. At this time, cluster start and stop operations are failing.
Posted Oct 07, 2020 - 13:23 PDT
This incident affected: gcp.qubole.com Environment (GCP) (Cluster Operations).