Azure Issue in South Africa
Resolved
Jan 27 at 12:03pm UTC
Initial RFO from Microsoft
What happened?
Between 07:05 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services including Microsoft 365 and Power Platform. While most regions and services had recovered by 09:00 UTC, intermittent packet loss issues were fully mitigated by 12:43 UTC. This incident also impacted Azure Government cloud services that were dependent on Azure public cloud.
What went wrong and why?
We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity across regions, as well as cross-premises connectivity via ExpressRoute. As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed.
How did we respond?
Our monitoring initially detected DNS and WAN related issues from 07:12 UTC. We began investigating by reviewing all recent changes. By 08:10 UTC, the network started to recover automatically. By 08:20 UTC, as the automatic recovery was happening, we identified the problematic command that triggered the issues. Networking telemetry shows that nearly all network devices had recovered by 09:00 UTC, by which point the vast majority of regions and services had recovered. Final networking equipment recovered by 09:35 UTC.
Due to the WAN impact, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices, and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network experienced increased packet loss from 09:35 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. This recovery was completed at 12:43 UTC.
How are we making incidents like this less likely or less impactful?
- We have blocked highly impactful commands from getting executed on the devices (Completed)
- We will require all command execution on the devices to follow safe change guidelines (Estimated completion: February 2023)
This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.
Affected services
Voice - ZA North (JNB)
Updated
Jan 25 at 10:42am UTC
We have seen incoming SIP UDP traffic successfully terminate in South Africa regions. We will monitor before removing the current redirection of SIP traffic via other regions.
Affected services
Voice - ZA North (JNB)
Updated
Jan 25 at 09:39am UTC
We are seeing issues with SIP requests sent over UDP
Affected services
Voice - ZA North (JNB)
Updated
Jan 25 at 08:55am UTC
Update from Microsoft - they now confirm the issue started at 07:05 UTC (09:05 SAST). We first saw issues picked up by our automated monitoring at 07:12 UTC (09:12 SAST)
Azure Networking - Multiple regions - Investigating
Starting at 07:05 UTC on 25 January 2023, customers may experience issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in multiple regions, as well as other Microsoft services. We are actively investigating and will share updates as soon as more is known.
Affected services
Voice - ZA North (JNB)
Updated
Jan 25 at 08:34am UTC
Azure have officially confirmed the issue - we await further information and RFO
Azure Networking - Multiple regions - Investigating
Starting at 07:30 UTC, we're aware of a networking issue impacting connectivity to Azure for a subset of users. We are actively investigating and will share updates as soon as more is known.
Affected services
Voice - ZA North (JNB)
Updated
Jan 25 at 08:21am UTC
We continue to see stabilised connectivity. We will keep monitoring but service appears to be back to normal.
Affected services
Voice - ZA North (JNB)
Updated
Jan 25 at 08:11am UTC
We are seeing latency return to normal levels. Monitoring
Affected services
Voice - ZA North (JNB)
Updated
Jan 25 at 08:08am UTC
We are seeing stabilisation of domestic connectivity within South Africa, albeit at elevated latency. Packet loss is currently near zero, but latency is around 200ms higher than usual.
Affected services
Voice - ZA North (JNB)
Updated
Jan 25 at 07:51am UTC
Where previously we saw no connectivity drops between other regions (for example UK South -> ZA North), we are now seeing intermitent connectivity issues within the Azure global network too.
Azure have opened an emerging issue - we are unaware if this issue is linked the issues we are seeing in ZA North + ZA West. We have also raised a critical fault with our Azure representatives.
Affected services
Voice - ZA North (JNB)
Updated
Jan 25 at 07:42am UTC
We are seeing latency indicating Azure is re-routing traffic outside of South Africa. Local destinations normally <3ms from JNB, are taking 280ms+ with 60%+ packet loss
Affected services
Voice - ZA North (JNB)
Updated
Jan 25 at 07:31am UTC
We are currently seeing 60%+ packet loss from Azure to local South African networks from both South African regions (JNB + CPT)
Affected services
Voice - ZA North (JNB)
Created
Jan 25 at 07:24am UTC
We are seeing issues in Azure
Affected services
Voice - ZA North (JNB)