We experienced a partial outage on May 6th between 18:32 UTC and 22:40 UTC and on May 7th from 13:19 UTC to 14:04 UTC, for a total of 4 hours and 53 minutes. We made an update to our servers that broke error handling for certain error conditions. Once we rolled back the changes, the outage was resolved.
The outage delayed the Eleos Platform's ability to process actions and messages that were flagged to include telematics data. This affected drivers who met all of the following criteria:
enable_telematics_data
set to true
During these outages:
manage_shipments
flag enabled potentially failed to retrieve updated load data.Actions and messages sent using all other forms were unaffected. The messages and actions that failed were re-tried by the mobile apps and, after the outage, they were processed and transmitted to customer web services.
Platform mobile app users who met the above criteria and were using the system during this time period were affected. If you and your users were affected by this, we have already reached out with more specific details.
Due to the small number of drivers who met the above configuration criteria, these errors did not occur in sufficient volume to trip our existing alerting mechanisms. As a result, the errors were not evident to the on-call operator for a relatively long period of time prior to being identified and rolled back. To prevent this from happening again, we are improving the integration between our servers and our existing monitoring tools to better surface low-volume errors introduced as part of a deployment. We're sorry for the impact this had on you and your drivers.