On Monday December 2, Incident IQ experienced a service interruption that caused degradation of services for several customers.
Timeline of events:
- 9:39 AM ET - Began receiving customer reports of slowness/degradation of services for portions of the platform to include login authentication.
- 11:41 AM ET - All services and platform responsiveness levels restored
Cause of Incident:
- As part of normal operating procedures, Incident IQ scales platform capacity to meet the needs of our current customer and predictive demands.
- Incident IQ uses the Microsoft Azure platform to automatically scale infrastructure as part of normal operating procedures. The process to scale encountered errors on Monday December 2nd and took five times longer than normal to complete.
- The combination of infrastructure not scaling on normal schedules and the larger load than the available infrastructure caused the outage.
Remediation:
- Services were ultimately restored by a combination of manually adding infrastructure resources and the automated scaling job completion
- Incident IQ is working alongside Microsoft and have identified that unforeseeable abnormally high transactions on the SQL environment during scaling caused the process to run longer.
- We are continuing to work with Microsoft to prevent a recurrence of this issue.
The reliability of our platform remains of the utmost importance to us. We understand the impact these moments have on our customers. The remediations put in place to prevent a recurrence of this particular incident, as well as the processes we have in place to continuously improve the platform, provide us with a level of certainty that we are able to stay ahead of unexpected surges in traffic.
As before, we do sincerely apologize for this disruption and want to thank you for your patience and partnership as we worked through this issue.