Accessiblity and performance issue on
Incident Report for Qubole
A large, ad-hoc workload running into unexpected errors drove an ongoing backlog of operations. The lagging operations have either been killed or finished, restoring regular performance. Devops is doing post-mortem monitoring of the environment.
Posted Jun 12, 2021 - 13:00 PDT
Update is currently running slowly due to extremely high throughput, likely complicated by an initial issue with burst throttling (now resolved). Though the number of connections and backlogged operations is consistently coming down, we're seeing delays in the webapp's ability to intake requests and process them. New nodes being added are seeing stability issues in the webapp tier, not accepting traffic and failing.
Posted Jun 11, 2021 - 21:05 PDT
Investigating is currently seeing some degraded performance, and occasionally returning 404 errors during access. At this time failures appear to be partial and intermittent, but Devops is investigating.
Posted Jun 11, 2021 - 12:16 PDT
This incident affected: Environment (AWS) (Site Availability, QDS API, Command Processing, Qubole Scheduler, Cluster Operations).