Software Upgrade Interruptions: The New Challenge for Resilience

Once again, a software upgrade gone wrong has brought core communications capacities to their knees, this time with fatal consequences. On September 18, a network software update at Australian telecommunications provider Optus brought the Triple Zero emergency services calling system in the Northern Territory, South Australia, and Western Australia to a halt for around 13 hours. At least three deaths, including an eight-week-old baby, have been attributed to the outage. As over 600 calls could not be connected, numerous other, less costly harms have inevitably ensued. Not least is the cost of lost confidence in the emergency calling system and Optus in particular.

The Triple Zero catastrophe comes nearly two years after another spectacular Optus upgrade failure. A border gateway protocol routing issue was triggered and resulted in the shutdown of the entire Optus network when the default tolerance thresholds set by Cisco Systems for the system were violated. A failed network upgrade similarly crippled Canada’s Rogers network in July 2022. The 2023 Optus outage cost the company over $1.2 million, likely only a fraction of the cost to the economy from lost service. The share price fell by around 5 percent, and the CEO was forced to resign.

The exact cause of the most recent Optus outage has yet to be confirmed. Due to the severity of the consequences, the Australian Communications and Media Authority has instigated a formal investigation, but informal sources have indicated a departure from established processes.

Such outages are not confined to telecommunications systems. The July 2024 CrowdStrike outage—which took down Microsoft’s cloud servers worldwide, crippling air and ground transportation, as well as finance, health care, media, and retail industries—also emanated from a software upgrade gone wrong. Nearly every day, a software upgrade goes wrong, frustrating the core functionality of essential and nonessential operations alike. This occurs in hospitals, schools, universities, government departments, stores, factories, and warehouses. Each individual outage may seem minor, but collectively, the lost profits, productivity, and other intangible consequences add up to a very large aggregate total.

The growing cost of upgrade outages derives from three interwoven sources. First, increased digitization of activities means that applications entirely reliant on computational capacity are handling more of our daily activities. Second, as centrally managed cloud-based data storage and application hosting replace local storage and processing on phones, local servers, and computers, functions once susceptible to failures of a small number of locally managed steps are now subject to diverse links covering both the movement of data and operational processing. The more links the data has to pass through to get from origin to destination and processing completion, the more software pieces are involved in handling it and, in turn, the more potential vulnerabilities to which it is exposed. Third, the complexity of the software processing the data is also increasing, as more and more intricate and complicated systems interact to manage and control the relevant operations. The seamless coordination of these various software-managed processes is essential to ensuring ongoing operational performance. An upgrade to any one of these pieces of software has the potential to uncover an unexpected incompatibility and break the chain—simply because there are so many possible permutations and combinations of functions that neither human nor system can reliably track.

From a supply chain risk management perspective, these three forces mean that risks to the resilience of operational delivery of all kinds—not just telecommunications services—have slowly and inexorably increased with the evolution of cloud computing. And arguably, these chains are at their most vulnerable when updates are made to software at any point along the chain. As there isn’t a test system mirroring the full scope of operations for these complex services to provide reassurance that nothing will go wrong, service outages from this source will inevitably both increase and impose their full costs in real time in the real world rather than in harmless test environments where they once manifested themselves.

While the University of Adelaide’s Mark Stewart has suggested that “there is a long standing worldwide trend for companies to inadequately resource the testing and disaster recovery associated with network planning associated with network upgrades,” it is already too late for rigorous testing in a network-based cloud world. But there is still time to consider our approaches to disaster recovery for when network upgrades go wrong. The responsibility for this lies with every operator along the chain. But who, in this context, will look out for the chain as a whole?