About 10 days ago, we started to experience degraded performance with The Things Network public community network region US West. We understand the impact for the community using this region, especially in times of a global health crisis, where long range wireless communication can be more important than before. We received those reports from the community, and we take our operations serious, while we do not and cannot provide a (commercial) service level agreement on the community network.
As with many complex issue, there was not one root cause that we identified and mitigated for. We could, however, relate most issues to connectivity problems in the transport layer, which were not detected or handled correctly by our services. We think that unreliability of the transport layer is partly caused by the colossal increase of load on Microsoft Azure (see https://azure.microsoft.com/en-us/blog/update-2-on-microsoft-cloud-services-continuity/) in the US West region. Also, we made some changes in how we handle traffic for multiple independent targets.
This is what we did:
Even though the actual patches are rather small, sometimes literally less than 10 lines of code, analyzing current behavior, reproducing it locally, testing fixes and deploying new versions is a very time consuming process. Although it seems otherwise, partly due to time zone differences from this particular region (US West) and where our operations team resides (Europe), and partly due to sparse communication, we spent a lot of time and effort on fixing the performance degradation of The Things Network in this region. We continue to do so, as we take the operations of our community network very seriously.
I hope you all stay safe and healthy in the coming months.