Inbound workflow issues with telematics enabled

Incident Report for Eleos Technologies

Postmortem

We experienced a partial outage on May 6th between 18:32 UTC and 22:40 UTC and on May 7th from 13:19 UTC to 14:04 UTC, for a total of 4 hours and 53 minutes.  We made an update to our servers that broke error handling for certain error conditions.  Once we rolled back the changes, the outage was resolved.

The outage delayed the Eleos Platform's ability to process actions and messages that were flagged to include telematics data.  This affected drivers who met all of the following criteria:

  1. The driver sent an inbound message, action, or workflow using a form with enable_telematics_data set to true
  2. The customer environment had the Geotab telematics integration enabled
  3. However, the driver did not have the telematics integration configured

During these outages:

  • Actions and messages that included telematics data submitted by drivers were delayed until after the outage.  Workflow actions that were submitted during these times fell back to an offline state if offline workflows were configured.  The apps then synchronized actions and messages after the outage.
  • Drivers with the manage_shipments flag enabled potentially failed to retrieve updated load data.

Actions and messages sent using all other forms were unaffected.  The messages and actions that failed were re-tried by the mobile apps and, after the outage, they were processed and transmitted to customer web services.

Platform mobile app users who met the above criteria and were using the system during this time period were affected.  If you and your users were affected by this, we have already reached out with more specific details.

Due to the small number of drivers who met the above configuration criteria, these errors did not occur in sufficient volume to trip our existing alerting mechanisms. As a result, the errors were not evident to the on-call operator for a relatively long period of time prior to being identified and rolled back. To prevent this from happening again, we are improving the integration between our servers and our existing monitoring tools to better surface low-volume errors introduced as part of a deployment.  We're sorry for the impact this had on you and your drivers.

Posted May 20, 2024 - 18:58 UTC

Resolved

We've confirmed the affected functionality is now fixed, so we're marking the incident as resolved.

Specifically, this issue caused an internal error when all of the following were true:

1. Sending an inbound message, action, or workflow using a form with `enable_telematics_data`
2. on an account with a telematics integration enabled, but
3. as a mobile app user who does not have the telematics integration configured

Inbound messages and actions that experience an error are retried, subject to device connectivity. Affected workflow actions will be delivered to your messaging service now that the underlying issue is fixed. Affected mobile apps would have reverted to offline workflow during the incident period from 18:30 UTC until 21:39 UTC.
Posted May 06, 2024 - 22:40 UTC

Update

We're continuing to verify that the functionality is working as expected, although monitoring indicates it's functioning normally. We're also working to isolate the specific configurations that were affected so we can share some additional detail about the scope of the failures. At this time, we believe that actions for users with a telematics integration configured at the account level, but disabled at the user level, were affected.
Posted May 06, 2024 - 22:21 UTC

Monitoring

We've rolled back the associated change and have seen the errors related to retrieving telematics info for inbound workflow actions drop to expected levels.

At this time, we believe workflow actions should be functioning normally, but we are doing additional checks to confirm.
Posted May 06, 2024 - 21:44 UTC

Update

We are continuing to work on a fix for this issue.
Posted May 06, 2024 - 21:26 UTC

Identified

We've identified a correlated change to an area of code responsible for attaching telematics info to inbound messages. We are rolling back the change.
Posted May 06, 2024 - 21:26 UTC

Investigating

We've received some reports of issues with inbound workflow messages that have telematics info enabled starting around 18:30 UTC. Our team is investigating these errors to determine the cause. We'll post another update shortly once we have more information.
Posted May 06, 2024 - 21:12 UTC
This incident affected: Eleos Platform (Telematics Integrations).