Incident Report: January 29, 2019
The following is the incident report for a partial FormAssembly outage that occurred on January 29, 2019, and affected Enterprise and Compliance Cloud customers. No Professional or Premier customers were affected. No data was lost during this incident, though users were unable to access forms. We understand this service issue has impacted our valued Enterprise and Compliance Cloud clients, and we apologize to everyone who was affected.
The issue was split into two incidents: one taking place from 3:47 p.m. to 4:30 p.m., and the other from 5:00 p.m. to 6:40 p.m. EST. Not every Enterprise client was affected by this outage; the first issue impacted 50% of Enterprise clients and the second issue impacted 15% of Enterprise clients.
From 3:47 p.m. to 6:40 p.m. EST, attempt to access FormAssembly resulted in errors and users being redirected to an error page. The issue affected a portion of Enterprise clients. The issue was caused by a database upgrade as part of a release process, which triggered a partial database lockout.
Timeline (all times U.S. Eastern Standard Time, GMT-5)
3:30 p.m.: Release process began
3:47 p.m.: Outage began
3:48 p.m.: Alerts went out to teams
3:52 p.m.: Problem identified
4:15 p.m.: Execution of database scripts stopped
4:15 p.m.: Scripts were executed manually on remaining databases
4:25 p.m.: All servers and instances were stable
4:30 p.m.: StatusPage was updated, issue resolved
4:55 p.m.: Reports of sporadic access errors identified
5:00 p.m.: StatusPage updated with partial outage
5:03 p.m.: Problem identified
5:05 p.m.: Some of the database scripts had timed out with completed status
5:07 p.m.: Database scripts were rerun manually on each of the uncompleted databases
6:40 p.m.: All scripts were run, confirmed and all client systems were functional
At 3:30 p.m. EST, a new release code deployment started for EC/CC clients. Non-intrusive database upgrades are part of the upgrade process; these are individual scripts that run on each database and take around 1 to 3 seconds to complete. These scripts run in parallel and finish quickly; however, at 3:47 p.m. with this release, a few of the databases on the server got locked up with exclusive locks and spiked server resource usage. This spike caused the database server to switch to the second node in the cluster. The problem was identified and automated jobs running the scripts were stopped. Scripts were run manually for the remaining databases. At this point all systems were functional based on spot checks and our monitoring system.
At 4:55 p.m. we started getting reports of sporadic access issues from a few clients as well as internal monitoring alerts. Some of the databases failed to upgrade during the automated scripts even though their status was reported as complete. We generated a report for such instances and started running database scripts manually to apply intended changes.
Resolution and Recovery
At 3:48 p.m., the monitoring and alerting systems alerted our team who investigated and quickly escalated the issue. At 3:52 p.m., the incident response team identified the issue to be related to database scripts and a database server resource spike.
At 4:15 p.m., we were able to stop the automated execution of database upgrade scripts, and we also started applying database upgrade scripts to the remaining databases.
All services were restored at 4:25 p.m., after which we monitored for 5 minutes and scanned the logs.
At 4:30 p.m., StatusPage was updated with a resolved status.
At 4:55 p.m., we got reports of sporadic access from a few clients.
We identified the issue as being left over from automated scripts. We were able to quickly run a report and fix the affected databases.
All systems were restored and functional by 6:40 p.m..
Corrective and Preventative Measures
In the past 18 hours, we’ve conducted an internal review and analysis of the outage. The following are actions we are taking to address the underlying causes of the issue and to help prevent recurrence.
- Enhance current database upgrade scripts release management to prevent resource locks. (Completed)
- Lower database server alarm limit from 80% to 60%. (Completed)
- Smart throttle parallel database deployment scripts based on server load. (Completed)
- Make database changes linear and backwards compatible. (Process documented for future releases)
- Implement accurate status code and error catching for database scripts. (Process documented for future releases)
FormAssembly is committed to continually and quickly improving our technology and operational processes to prevent outages. We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support.
– The FormAssembly Incident Response Team