Degraded performance issue on api.qubole.com

Incident Report for Qubole

Resolved

The degraded performance issue on api.qubole.com is resolved.

Posted Mar 18, 2022 - 04:28 PDT

Update

Thank you for your patience as completed resolution of the issues pertaining to the Qubole API control plane. Customers' scheduled jobs should run successfully at this point. If you are still seeing issues, please communicate those to support.

1. We will also continue to monitor all job queues. Currently all jobs seem to be queuing and ending fine.

Posted Mar 17, 2022 - 22:10 PDT

Update

Thank you for your patience as completed resolution of the issues pertaining to the Qubole API control plane. Customers' scheduled jobs should run successfully at this point. If you are still seeing issues, please communicate those to support.

1.We will continue to monitor the bastion node connectivity.
2.We will also continue to monitor all job queues. Currently all jobs seem to be queueing and ending fine.

Posted Mar 17, 2022 - 16:52 PDT

Update

Posted Mar 17, 2022 - 13:06 PDT

Update

Thank you for your patience as we work to resolve the last remaining issues pertaining to the Qubole API control plane. We have solved all the technical issues except for two points noted in points 1) and 4) below. Most customers' scheduled jobs should run successfully at this point. If you are still seeing issues, you may be one of the customers mentioned in point 1) or you may be experiencing an issue mentioned in point 4). Please communicate with support if you are still facing issues.

1. The "thrift.transport.TTransport.TTransportException” coming from python when attempting to make connection to VPC subnets is still occurring on select accounts. This error seems to be coming due to connectivity issues from the customer side. This is preventing communication to the Bastion nodes. We have verified this for one customer and informed them. We have found 5 more customer instances and are confirming. We are working with these customers to resolve.

2. DevOps has completed clearing jobs stuck in the queue. They will continue to monitor.

3. We continue to monitor all queue jobs. Currently all jobs seem to be queueing and ending fine.

4. The team found a shared tunnel elastic IP which is dangling (not mapped to tunnel server). The Team mapped it to an active tunnel this seems to resolve some of the connectivity issues we were seeing in VPC environments.

Posted Mar 17, 2022 - 09:48 PDT

Update

The team is working on connectivity between SQL and RDS, also checking the tunnels which are configured and failing. Few accounts are still facing encryption errors (intermittent issues ) but these are network connectivity specific issues and team is working on it. Also some tunnel server ips are added on the cluster feature page.

Posted Mar 17, 2022 - 06:58 PDT

Update

Technical team is trying to resolve "thrift.transport.TTransport.TTransportException error associated with tunnels, get_metastore_for_account - Couldn't create encrypted channel to rds and Unable to connect to bastion node. Once these errors resolved, will correct issues with command execution and issues with clusters. In addition, the technical team is monitoring the scheduled jobs.

Posted Mar 17, 2022 - 02:34 PDT

Update

Devops Team have shared some of the most recent findings and ongoing activities to resolve the issue.

There were three issues reported by customers:
Commands were getting stuck from UI
Clusters were not starting
Scheduled jobs were getting stuck

There is a common root cause behind these problems. Investigation from the team suggests that due to recent VPC changes (moving from classic non-vpc to vpc) some of the tunnel configurations have been impacted. Hence encrypted channels are intermittently failing from Qubole's control plane to customers' data plane. Team is working to rectify this.

Regarding scheduled jobs- Team has now cleared all the stuck jobs from last one week and is continuing to monitor the service.

Posted Mar 16, 2022 - 23:08 PDT

Update

Devops Teams have shared some of the most recent findings and ongoing activities to resolve the issue.

1. The suspected cause of api.q outage appears to be moving Scheduler tier from classic (non-vpc) to vpc by DevOps on 8th March without assessing the risk as part of R60 rollout preparation. Since the scheduler stopped working from 8th March, other infra components failed, tunnel servers were affected. The R60 build was also put on api.q. that later led to cross-connection between different conduits. The architects are reviewing all code in case we need revert and/or make configuration changes.
2. DevOps is continuing to clear jobs stuck in the queue. They are doing this incrementally so as not to overwhelm the tunnel servers as they begin to run.
3. We determined tunnels are misconfigured causing performance issues. We are fixing and changing tunnel servers out. This should fix the tunnel issues
4. The issue with python connectivity in VPC environments has been resolved and we are monitoring. We are seeing intermittent connectivity issues from the tunnel servers to the metastore for various customers. We believe that once we finish addressing #3 that this will be resolved.
5. The scheduler does nothing but run the job schedule and does not execute code. The architects are currently comparing the code in the scheduler on API with the code on US, which is working fine, to see what the code differences are.

Posted Mar 16, 2022 - 17:01 PDT

Update

Devops Teams have shared some of the most recent findings and ongoing activities to resolve the issue.

1. The suspected cause of api.q outage appears to be moving Scheduler tier from classic (non-vpc) to vpc by DevOps on 8th March as part of R60 rollout preparation. Since the scheduler stopped working from 8th March, other infra components failed, tunnel servers were affected. The R60 build was also put on api.q. That later led to cross-connection between different conduits. The architects are piecing together any code that was not reverted back and/or any configuration changes that need to be reverted.
2. DevOps is clearing jobs that were stuck in the queue. They are doing this incrementally so as not to overwhelm the tunnel servers as they begin to run.
3. We determined that due to over rotation of tunnels that all tunnels are misconfigured. This is being addressed and should fix the tunnel issues.
4. The connection issue with python is due to connectivity issues with customers using private subnet. They can use a non-VPC or a VPC. It appears to be only VPC connections. We have moved python expertise to triage and solve this issue.
5. The scheduler does nothing but run the job schedule and does not execute code. The architects are currently comparing the code in the scheduler on API with the code on US, which is working fine, to see what the code differences are.

Posted Mar 16, 2022 - 12:16 PDT

Update

The Devops team has cleared all the stuck jobs which were in the submitted state from March 8th Onwards. The Team is currently monitoring the requeued jobs which were under processing. To debug the intermittent job failure issue, team has put loggers on the code which was giving error. Team is also analyzing the RDS logs to check if any configuration change is required.

Posted Mar 16, 2022 - 07:57 PDT

Update

The commands submitted manually are working fine. The Devops team performed the checks on the Tunnels and Nodes.
The team discovered multiple jobs which were stuck in Scheduler, the team is Clearing all the stuck jobs.

Posted Mar 16, 2022 - 04:33 PDT

Update

A few more errors of cross-connection between different conduits got detected. The technical team is debugging further.

Posted Mar 16, 2022 - 00:53 PDT

Update

DevOps team is still working on the root cause of the issue and trying to resolve it soon.

Posted Mar 15, 2022 - 21:12 PDT

Update

Our DevOps found some underlying issue that is causing the failure of the intermittent job and looking to find the root cause.
DevOps team is actively monitoring all queue jobs and the jobs seem to queuing and ending fine.

Posted Mar 15, 2022 - 18:25 PDT

Update

Qubole has been experiencing periodic outages on api.qubole. We are working to resolve this. We have resolved most issues, but if you are still experiencing issues with scheduled jobs not starting and finishing file a ticket with support. Below is a list of what has been done so far as well as our next plan of action to stabilize the platform.

1. Issue: The memcache connectivity from worker nodes was lost
Resolution: Fixed the issues with the configuration and replaced the worker nodes. This fixed the major issues of jobs getting stuck
Status : Done

2. Issue: Some of the worker and discovery tier nodes were trying to connect to VPC based Redis server and were failing .
Resolution: Team fixed this issue by pointing the Redis server DNS to non VPC based Redis and resolved further issues in jobs getting stuck
Status : Done

3. Issue: Common tunnels and dedicated tunnels started giving error due to heavy load built up due to jobs pileup.
Resolution: The common tunnels and dedicated tunnels were replaced with new tunnel machines to ease the traffic.
Status : Done

4. Issue: Chef Run is failing to execute few shell commands in the scheduler node.
Resolution: Trying to re-run the Chef client manually, Issue fixed.
Status : Done

5. Issue: Python connection issue
Resolution: There still seems to be an underlying issue which is causing the intermittent jobs failure.
Status : In-Progress

6. Issue: Cleanup of stuck jobs and clusters
Resolution: Various customers are still seeing intermittent issues. These are due to cleanup needing to be done on a customer basis.
We are waiting for further response from these customers
Status : In-Progress

7. Issue: 100% disc space full
Resolution: We have rotated the worker, client and web app tier nodes which were facing space issues
Status : Done

8. Issue: Cleared jobs stuck in processing since March 7th
Resolution: Changes the status to canceled
Status : Done

9. Issue: Found Python 3.8 errors
Resolution: Found the error in logs
Status : In-progress.

10.Issue: Monitoring all jobs
Resolution: DevOps team is actively monitoring all queued jobs
Status : In-progress.
.
Current Action Items:

-Debug the remaining customer issues and resolve asap
-Monitoring the application health and quickly replace the bad tunnels (temporary measure till we resolve issue #6)
-Working closely with Qubole Support team to quickly address the customer concerns related to specific clusters and critical jobs

Posted Mar 15, 2022 - 15:48 PDT

Update

Posted Mar 15, 2022 - 12:47 PDT

Update

Posted Mar 15, 2022 - 10:49 PDT

Update

Posted Mar 15, 2022 - 07:50 PDT

Update

DevOps Team has identified jobs stuck in processing. Team is working to clear the stuck jobs. Team has rotated the worker, client, and webapp tiers node which were facing some space issues.

Posted Mar 15, 2022 - 04:44 PDT

Update

The support team and the DevOps team are actively working on this and we are very close to resolving this issue. On the back side we are replacing the bad node and trying to fix it as soon as possible.

Posted Mar 15, 2022 - 00:56 PDT