Degraded performance issue on api.qubole.com

Incident Report for Qubole

Resolved

The issue with the rStore database has been resolved. Customers should be able to execute their jobs and workloads now.

Posted May 12, 2022 - 00:50 PDT

Monitoring

The issue with the rStore database has been resolved. Customers should be able to execute their jobs and workloads now.

Posted May 11, 2022 - 20:02 PDT

Update

We are still proceeding with the plan as outlined and on track to complete by 12:00 CST. We will update here if there are any changes.

Posted May 11, 2022 - 18:06 PDT

Update

Upon further investigation and working with AWS support we have a new update and plan:
1. In working with AWS this afternoon, DevOps figured out that a table reached the MySQL 2TB limit. This table is a system table so we cannot delete data.
2. The cause is that multiple tables are writing to the same file. Good practice would have been to have a separate datafile for each table, which was not the case.
3. To fix they will:
-Backup a handful of tables they are going to move data into their own files.
-Drop those tables and recreate them with their own data files.
-Restore the data to those tables which should move the data into their own data files and split it out of the data file with the 2TB limit thus
freeing space.
4. This should defragment the database and free up space while decreasing the file size of the data file running into the limit.

This will be a temporary measure to get back up and running. The process of testing and implementation should take the next 8 hrs or so depending on the data load. We estimate that by 12:00 CST to be complete and back up. The long term solution is to rebuild the entire database. That can be done offline and then cutover to it once it's ready, so no downtime would be involved. We have done similar updates in the other regions with no impact or downtime with customers.

Posted May 11, 2022 - 15:10 PDT

Update

As per the last update, we are still in the progress of moving the data.

Posted May 11, 2022 - 13:04 PDT

Update

Latest Update:

What caused the outage

* The Rstore database had a table that filled up and also caused the disk space to fill up, which caused the database to not respond.
Customers are not able to run jobs because of the unresponsive Rstore database

What has been done to resolve so far

* Increased memory and storage on instance

* The table was cleared but the disk space was not reclaimed and is still full.
* Engaged AWS and determined that we cannot set the parameter for the table to autoscale because it has to be set upon creation.
* Created a new instance from the old database increased storage and memory.

What’s Next

* The new mySQL database in in place, and setup is complete.
* Export data to S3 from prior DB, in progress.
* Import Data from prior instance to new instance.

Estimated ETA to complete the data load is 24hrs due to the size of the MySQL database (1TB+). We are working with AWS to identify any methods to decrease data load time. We will provide updates here if there is any change to the timeline.

Posted May 11, 2022 - 08:08 PDT

Update

-Right now, the Task is Under Investigation.
-Given the current RDS DB (MySQL) instance is using the deprecated major version (5.6.39) and the tablespace seems full even after applying the innodb_file_per_table=1.
-The team is currently working to migrate the environment along with DB to a supported version of MySQL.

We are continuing to investigate and will update accordingly.

Posted May 11, 2022 - 03:13 PDT

Update

Latest updates:
-Cleared the storage issues and the low memory on the longer running tunnels.
-Updated the RDS memory from 5000 GB to 5500 GB in the production rstore RDS instance as well as the replicate production rstore. This takes about 6 hours as per Amazon document. We started it about 5PM CST, so around 11PM CST the updated instance with added memory size should be up and running

After taking steps to free up storage the issue still exists and the storage is not being released. We are continuing to investigate and will update accordingly.

Posted May 10, 2022 - 20:50 PDT

Update

We continue to work on clearing resources and expanding the limits in the rStore database. We should have an ETA shortly.

Posted May 10, 2022 - 18:15 PDT

Identified

We have identified a full table in the Rstore database that appears to be causing the issue. We are in the process of clearing that condition.

Posted May 10, 2022 - 14:37 PDT

Investigating

Several customers are experiencing issues when scheduling jobs. We are looking into the matter and will update shortly.

Posted May 10, 2022 - 12:39 PDT

This incident affected: Command Processing, Cluster Operations, and Notebooks.