Elevated Error Rates

Incident Report for Eleos Technologies

Postmortem

Between 2023-11-13 at 23:05 UTC and 2023-11-14 at 00:01, for a total of 56 minutes, the Eleos Platform failed to process approximately 45% of incoming requests. At 23:14 UTC, our on-call engineers were paged due to the elevated error rates, and we started taking efforts to stabilize the errors by scaling up the processing capabilities for the system. This temporarily reduced the error rate, as the errors started to decline around 23:36 UTC, but then started to increase again at 23:49 UTC. Our engineers continued to take efforts to scale up the system and at around 2023-11-13 23:57 UTC, the failure rate started to decline, and by 2023-11-14 00:01 the error rate had returned to normal. During the incident window, drivers using the mobile app would have experienced intermittent issues with logging in, fetching their loads and sending messages, for example, and the app would have essentially functioned as if it was in offline mode or experiencing marginal network conditions. This also impacted requests to our public-facing APIs, such as retrieving documents from the document API and sending messages to drivers.

‌

The underlying issue was due to an unanticipated interaction between the subsystem responsible for recording web service integration errors and overall request processing. The subsystem responsible for dealing with client API errors got behind on processing errors, which held up other subsystems, resulting in widespread request failure. The responding engineers worked to rectify this problem and saw the error rate drop dramatically.

‌

While the responding engineers worked to relieve the above issues, less processing power was being used by the Platform servers since normal request processing was reduced due to the above issue. This caused the automatic scaling process to scale the number of running servers down, which further exacerbated the problem. The engineers immediately intervened and forced the automatic scaling process to scale up instead of down.

‌

To prevent this from happening again, Eleos has identified and confirmed the underlying issue through logging, application tracing, recorded metrics, and new testing methods designed to identify unexpected interactions like this one. We are working to implement and release a complete fix. Until this work has been completed, mitigations are in place to prevent a recurrence.

Posted Nov 20, 2023 - 17:19 UTC

Resolved

Extra capacity has been provisioned, and error rates and response times are normal.

Posted Nov 14, 2023 - 00:51 UTC

Update

Error rates and response times are normal. We have provisioned extra capacity and are investigating the contributing factors to the incident.

Posted Nov 14, 2023 - 00:32 UTC

Monitoring

Errors rates are below 5% and response times are approaching normal as we continue monitoring. At this time, apps will come out of offline functionality and will resend messages as they come back online.

Posted Nov 14, 2023 - 00:12 UTC

Identified

We've identified a problem with our database capacity and we're working on provisioning more connections to compensate. Error rates are beginning to trend downwards.

Posted Nov 13, 2023 - 23:59 UTC

Investigating

We are experiencing elevated error rates and are currently investigating at this time. Apps will fall back into offline mode and will retry sending messages.

Posted Nov 13, 2023 - 23:39 UTC

This incident affected: Eleos Platform (API, Dashboard, Mobile Apps, Telematics Integrations, Document delivery).