Skip to main content
Post

Resilience: The New Challenge for Digital Systems Policy

AEIdeas

February 10, 2025

Yesterday around 11:00 am, while working from home preparing course material for the next university semester, all Microsoft cloud-related connection to my university was severed. No OneDrive access to my files, no Teams to contact the information technology helpdesk, no Outlook to email my manager about my predicament …. I was truly isolated. Fortunately, the website had not gone down so I was able to locate the Plain Old Telephone System (POTS) number for the helpdesk, only to find that I was 27th in the queue and could I please press 1 to be called back when a technician was available.

Once upon a time, I was quite resilient to these sorts of events: I had all my files located on my laptop; synchronizing the cloud with the laptop was simply a backup exercise. When the university systems went down, I could keep on working because everything I needed was on hand. I teach (amongst other things) resilience, so I try to practice what I preach. But my new university-issued computer set up on the university standard had no files downloaded.  I even had to request specifically to synchronize OneDrive with my machine. And yesterday I discovered that the synchronization was occurring only for files I accessed. So my 25 years of teaching notes backed up on OneDrive—precisely the material I needed for the task I was embarking upon—were beyond reach for the seven hours it took for the university to notify me that the problem had been sorted. The notification, of course, was by Outlook, which I learned had become available only by checking for the umpteenth time. 

My day of lost productivity was multiplied at least 2500 times across the entire university’s staff, with flow-on effects for students both in class where teachers could not access materials and outside it in accessing their files stored using their university accounts. Small consolation, then, that my classes could have proceeded relatively uninterrupted if I had been teaching because of my old-fashioned—but resilience-driven—habit of taking all my materials to class on a USB drive.

The cause of the catastrophe was a botched software upgrade. Just like the catastrophic CrowdStrike outage in July last year, when a faulty update to security software wreaked havoc with Microsoft servers, plunging entire industries worldwide into chaos as their systems crashed. The airline industry alone saw over 5000 flights cancelled and airports clogged up with irate travelers.  And just like the 2023 Optus outage in Australia and Canada’s 2022 Rogers debacle.

The problem revealed by these outages is that, in the process of digitizing much of society’s commercial and social activity and consigning it to increasingly centralized cloud-based technologies, the costs of conducting many activities have reduced at the cost of increasing vulnerability to a host of new, unexpected catastrophes. Much effort in software resilience has focused on detecting, deterring, and recovering from cybersecurity attacks, and national-level inquiries were launched into the CrowdStrike, Optus and Rogers events. Yet arguably, the low-level outages—such as the one experienced at my university—that likely occur on a daily basis across the globe as various software is upgraded (a now very-frequent event) that have the largest cumulative effect in lost productivity.  These fly below the radar, because they are seen largely as a matter to be addressed by the firms concerned. The software firms usually get the software working rapidly, but this does not compensate the client firms and their customers for the lost productivity caused by the outage.

To be fair, the firms concerned have saved large costs by outsourcing their information technology operations to cloud services, so may see the occasional lost productivity as just a cost of doing business. But this does not address the flow-on social costs arising from outages such as interruption to students’ learning, travel chaos, or business that could not be conducted.

It behooves all firms to consider how they can better make themselves resilient to outages from all causes—including those from changing climate and uncertain geopolitics, which can have similar effects of cutting access to the cloud. It also benefits policy-makers to consider their role in addressing the social costs. Firms will invest sufficient resources in resilience to meet their own direct needs, but not the flow-on costs that accrue to their clients and society more generally. There is a valid role for government policy to address these social consequences. Government policies have aided, abetted, and encouraged (even subsidized) the digitization agenda, but the risks of reduced resilience from these policies have so far been largely overlooked and passed back to firms and individuals to manage.  

With AI and its attendant uncertainties now entering the mix, the time is right for a policy discussion on digital systems resilience extending across both software and hardware providers, and the firms using their services about making and funding more resilient systems.

Learn more: Satellite Broadband Connectivity: Many Pros, but Also a Few Cons | Why Meta’s Change in Fact-Checking Is Good for Democracy | The Price of Keeping Promises to Keep Australian Children Safe Online | Political and Regulatory Boundaries in Telecommunications