Intermittent downtime in US-West region

Incident Report for The Things Network

Postmortem

About 10 days ago, we started to experience degraded performance with The Things Network public community network region US West. We understand the impact for the community using this region, especially in times of a global health crisis, where long range wireless communication can be more important than before. We received those reports from the community, and we take our operations serious, while we do not and cannot provide a (commercial) service level agreement on the community network.

As with many complex issue, there was not one root cause that we identified and mitigated for. We could, however, relate most issues to connectivity problems in the transport layer, which were not detected or handled correctly by our services. We think that unreliability of the transport layer is partly caused by the colossal increase of load on Microsoft Azure (see https://azure.microsoft.com/en-us/blog/update-2-on-microsoft-cloud-services-continuity/) in the US West region. Also, we made some changes in how we handle traffic for multiple independent targets.

This is what we did:

Better detection of broken connections by setting gRPC keep-alive parameters;
Make the fan-out logic of traffic to multiple independent targets (i.e. from V3 Gateway Server to V2 Router and to Packet Broker) really independent, by avoiding that one blocks on the other;
Increase the reliability and resilience of the V3 Gateway Server to avoid unexpected restarts

Even though the actual patches are rather small, sometimes literally less than 10 lines of code, analyzing current behavior, reproducing it locally, testing fixes and deploying new versions is a very time consuming process. Although it seems otherwise, partly due to time zone differences from this particular region (US West) and where our operations team resides (Europe), and partly due to sparse communication, we spent a lot of time and effort on fixing the performance degradation of The Things Network in this region. We continue to do so, as we take the operations of our community network very seriously.

I hope you all stay safe and healthy in the coming months.

Posted Apr 03, 2020 - 10:10 CEST

Resolved

Our US-West region has been stable since yesterday's changes, so this incident will now be closed.

Posted Apr 02, 2020 - 09:24 CEST

Update

We have identified a new possible cause for the issues in the US-West region and new mitigation measures have been implemented. We will continue to monitor the situation and take further action when needed.

Posted Apr 01, 2020 - 11:44 CEST

Monitoring

We finished a full re-deployment of our US-West region that we started yesterday afternoon (UTC). We hope that our new servers are more stable than the old ones. We will monitor the situation and take further action when needed.

Posted Mar 31, 2020 - 11:03 CEST

Investigating

Even with the mitigations in place, we still experience frequent outages on our us-west region. We are still investigating, but have so far not been able to pinpoint the root cause of the issues.

Posted Mar 27, 2020 - 10:04 CET

Identified

We have found a possible cause for the issues in our US-West region and have put measures in place to mitigate the impact.

Posted Mar 23, 2020 - 17:51 CET

Investigating

We are currently investigating reports of issues with our US-West region.

Posted Mar 23, 2020 - 14:18 CET