Issue with increased error rates for AWS API

Incident Report for Qubole

Resolved

Qubole DevOps has verified the resolution from AWS and verified our internal cluster operations along with clearing all related operational alerts. The resolution from AWS was posted at 10:52 AM PDT. We are resolving this incident at this time.

Update from AWS: Between 5:18 AM and 10:25 AM PDT we experienced increased error rates for some EC2 APIs and new instance launches in a Single Availability Zone in the US-EAST-1 region. Existing instances were unaffected. We are working to address API errors affecting a small number of EBS volumes as a result of this issue. The issue has been resolved and the service is operating normally.

Posted Jul 29, 2020 - 11:24 PDT

Update

Qubole DevOps received a recent update from AWS:

10:19 AM PDT We have deployed a fix to the impacted EC2 sub-system causing increased API error rates and new instance launch failures in a Single Availability zone in the US-EAST-1 Region and are beginning to see recovery. We continue to work towards full resolution. Existing instances remain unaffected by this issue.

We have been testing our own internal cluster operations, have seen improvements and will continue to verify as the issue clears on the AWS side.

Posted Jul 29, 2020 - 10:34 PDT

Identified

Qubole continues to monitor the AWS API error rate issue in the us-east-1 region. At this time, the Availability Zone (AZ) performance is sporadic and inconsistent. AWS recognizes that existing instances were not affected, so existing clusters are generally operational. For your current cluster start operations, we can recommend the following: if you cannot start your cluster, in the cluster startup log, you will notice the AZ referenced. You may remove the private subnet for that AZ in your cluster config if many subnets are configured or replace with a private subnet of a different AZ.

Similarly, attempts to upscale or downscale your cluster including acquiring spot nodes may run into similar errors. If you are trying to downscale or terminate your cluster, you may need to attempt this multiple times in the Qubole UI or via API

Our goal is to work with AWS to ensure this issue is resolved as expediently as possible, but at this time, there is no definitive ETA as of their 8:23 am PDT update. If you would like to follow along with their updates, they are here: https://status.aws.amazon.com/

Posted Jul 29, 2020 - 09:28 PDT

Investigating

Qubole received notification from AWS on their status page (https://status.aws.amazon.com) as follows:
6:21 AM PDT We have identified the cause of the increased API error rates in a single Availability Zone in the US-EAST-1 Region and continue working towards resolution. Customers experiencing errors launching new EC2 instances may attempt to launch their EC2 instances in another Availability Zone.

Qubole customers might have impact due to this for their cluster operations.

Posted Jul 29, 2020 - 07:13 PDT

This incident affected: us.qubole.com Environment (AWS) (Command Processing, Cluster Operations) and Command Processing, Command Processing, Cluster Operations, Command Processing, Cluster Operations.