Abstract
Telecommunications networks have become one of modern society’s critical infrastructures (CIs): things required for everyday life and without which widespread disruption can be expected. Historically, the responsibility for ensuring the resilience of their own infrastructures has lain with the individual network operators. However, the complex ways in which economic and social systems now depend crucially on the efficient functioning of an internet system comprised of multiple different operators across the three internet layers creates an additional value of network resilience that will not be adequately captured in the incentives facing any single operator alone. In these circumstances, society benefits from some collective co-ordination to address the externalities.
As befits a complex nexus of interacting systems, this paper provides a multidisciplinary exploratory examination of the concept of internet ecosystem resilience and its relationship to (foundations in) telecommunications resilience given the challenges posed by increasing systemic complexity. It finds existing arrangements addressing both physical infrastructure and cybersecurity resilience leave important gaps in internet ecosystem resilience, particularly when addressing the wider social and economic consequences of ecosystem interruption. More research into these consequences is indicated, and attention should also be given to addressing the gaps in funding network infrastructure resilience to address continuity in service and benefits accruing from application layer resilience when this is left to network infrastructure providers alone.
1. Introduction
Telecommunications networks have become one of modern society’s critical infrastructures (CIs): things required for everyday life and without which widespread disruption can be expected. They provide the channels over which information travels to facilitate operation of the internet, which is typified as a three-layer structure. The infrastructure layer consists of the physical links over which the internet functions, such as cables, satellites, towers and internet exchange points. These are overlaid by a logical layer identifying connected devices and the protocols by which they communicate, and a content or application layer (Australian Government, 2024). Whilst conceptually simple, this three-layer structure conceals the underlying reality that the internet is a complex web of interconnected networks, both physical and virtual, within, across and between the layers. For everyday life to function efficiently and effectively, this network web needs to be resilient: that is, be able to “resist, absorb and adapt to disruptions and return to normal functionality” (Blake et al., 2019, p. 2). This is in addition to resilience of each of the underlying components. Moreover, the human systems which internet applications serve – which together with the internet infrastructure comprise the internet ecosystem – must also be resilient to disruptions.
In its initial conceptualization the internet was considered a more resilient physical infrastructure system than the centrally controlled dedicated point-to-point telecommunications infrastructures and bilateral contractual alliances primarily supporting voice telephony that that preceded it (Keary, 2024). Decentralized and dynamic control of traffic routing across mesh networks using internet protocols such as Border Gateway Protocol (BGP) allows data to still be transmitted even if one or more paths (links in the network) fail. These same arrangements allow for more efficient use of network resources by managing traffic flows to avoid congestion (RUSI, 2007). Yet the responsibility for ensuring the resilience of individual components of the internet infrastructure has lain with the individual operators. Each faces its own commercial incentives to invest in sufficient resources to manage its own reputation and revenue flows by minimizing the amount of down-time for its own customers. Resilience of the internet infrastructure in aggregate relies on the best efforts of each of these individual actors.
The complex ways in which economic and social systems now depend crucially on the efficient functioning of an internet system comprised of multiple different operators and functioning across all three internet layers creates an additional demand on internet resilience that goes beyond the functioning of physical data transportation alone. Actions at the application layer create externalities across other layers of the network (and vice versa) that cannot be adequately captured in the incentives facing any single operator alone. This is a classic tension between private incentives and societal welfare – essentially a coordination and externalities problem in critical infrastructure resilience. The resilience of the internet ecosystem is a public good, but investments in resilience are made primarily by private actors who have their own strategic priorities and may not fully internalize the social cost of network failures (Alderson & Doyle, 2010; Rinaldi et al., 2001).
Moreover, technological changes have altered the locus of transactional relationships in the internet ecosystem. In the embryonic internet ecosystem an end-user’s internet interaction was mediated, both physically and financially, by a telecommunications (internet) service provider (ISP) in the same physical and jurisdictional space (e.g. as providers of email, and the “triple play” content (usually cable), internet (web browsing) and telephony bundle). In the current ecosystem, while ISPs remain functionally essential for data transmission, they have become marginalized as financial intermediaries in the internet experience as increasingly, applications providers assume direct financial relationships with end users, at the same time expecting a range of service quality levels required for their applications to perform optimally will be addressed physically and financially in the relationship between the ISP and end users, in which the applications providers play no part (Gautier & Somogyi, 2020; ITU, 2020).1 For example, interactive communications previously managed as a telephone service are now managed by apps such as WhatsApp, Zoom and Teams, while local cable television content transmitted by the ISP has been in large part substituted by video streaming from YouTube, Netflix and the like.
Together, these issues raise significant questions regarding responsibility for both ensuring the resilience of the internet ecosystem and funding its provision. Specifically, is it still sufficient to rely primarily on the resources of infrastructure providers to guarantee the resilience of the internet ecosystem for the public good? And if not, how should the nexus between physical infrastructure resilience and internet ecosystem resilience be managed and funded?
This paper provides an exploratory examination of the concept of internet ecosystem resilience and its relationship to (foundations in) telecommunications resilience given the challenges posed by increasing systemic complexity. It begins with a review of key elements of the resilience literature with specific reference to telecommunications infrastructure and the challenges posed as complexity increases and traditional boundaries between system layers, both physical and financial, have become blurred. We note that befitting the complexities of the ecosystem, this discussion draws from a wide range of disciplinary literatures: complexity science, engineering, management, economics, public policy and law. It then illustrates the complexities via discussions of challenges posed to physical infrastructure layers (e.g. by way of climate change and terrorist activities) and the application and software layers (e.g. software malfunctions and malicious attacks, causing widespread service outages). Next, it returns to the questions posed above and evaluates them in light of the discussion in the foregoing two sections. It concludes with some recommendations for both policy and practice.
2. Literature review
The multidisciplinary literature review illustrates the complexity of the internet ecosystem and the challenges this poses for resilience.
2.1. Resilience: definitions
Like many other concepts, resilience has multiple definitions and applications depending on the context (Pells, 2023). Notwithstanding, internet system resilience is grounded in the realm of complex networks. Qi and Mei (2024) explicate the nexus between complex network type and demands on resilience across four network classifications: Biological; Social; Information; and Technological. While telecommunications infrastructures operate within the realm of technological networks, internet applications operate within social and information network classifications. Fig. 1, reproduced from Qi and Mei (2024: 20), summarizes definitions, properties and resilience emphases of the different network types.

Consistent with Qi & Mei’s typology, resilience in telecommunications and other infrastructure networks focuses on recovery, stability, resistance, bouncing back, rebounding and returning. As information networks, internet infrastructures focus on withstanding, resisting and persisting. OECD (2019: 36) defines infrastructure resilience as “the capacity of critical infrastructure to absorb a disturbance, recover from disruptions and adapt to changing conditions, while still retaining essentially the same function as prior to the disruptive shock.” In this conception, infrastructure resilience combines both technological and information network characteristics.
Firesmith (2019) draws a distinction between infrastructure network resilience and organizational resilience. He observes that telecommunications and internet infrastructure resilience has relied strongly on an engineering-centric conception, where the focus on redundancy and diversity, modularity and connectivity, robustness and stability, autonomous organization, and scalability casts resilience as a matter of design rather than a process. In Firesmith’s view, a system is resilient if it continues to carry out its mission in the face of adversity. It must resist adversity and provide continuity of service, possibly under a degraded mode of operation, despite disturbances due to adverse events and conditions. It must also recover rapidly from any harm that those disruptions might have caused. No system will be 100 % resilient to all adverse events or conditions. However a resilient system must incorporate some controls supporting detection and others supporting response and/or recovery.
Resilient infrastructures, however, must be designed, engineered and managed within the context of an organization – a social and/or commercial system construct. Brown et al. (2017:8), takes an institutional (organizational) perspective when defining resilience as “the ability of an organization to plan for and adapt to crises in order to survive and thrive in an uncertain world.” This definition couches the responsibility for resilience in the domain of Qi & Mei’s social network: for the infrastructure and information networks to be resilient, the groups or social organizations overseeing them must anticipate vulnerabilities, maintain stability, adjust and themselves to endure. However, the responsibility for maintaining the stability of the “social networks” (including firms and their clients as communities of common interest) utilizing telecommunications and internet infrastructure lies primarily with the operators of these networks/applications. Their stability is influenced by the resilience of the other networks, but it is not the responsibility of the telecommunications and internet infrastructure providers to ensure the resilience of the systems and applications utilizing their services.
Thus, while the relevant networks can be distinctly described, and classified as per Qi & Mei, the complex interdependencies between them mean that the responsibilities for ensuring their resilience cannot be so neatly divided. Furthermore, consideration must be given to both the ability to withstand exposure to uncertain hazards with minimal loss of functionality (operational resilience) and the return to a basic level of functionality as soon as possible: that is, limiting the extent of damage; and limiting the duration of interruption caused by the damage, and its flow-on effects on other system users.
2.2. Public and private benefits in network resilience
There exists a tension between private incentives and societal welfare – essentially a coordination and externalities problem – in critical infrastructure resilience. In economic and regulatory terms, this can be viewed as a form of market failure: the resilience of the telecommunications and internet ecosystems as a whole is a public good, but investments in resilience are made primarily by private actors who have their own strategic priorities and may not fully internalize the social cost of critical infrastructure failures. This phenomenon is well known in cybersecurity and information security, and has strong public good characteristics (Bauer & Van Eeten, 2009).
Certainly, each telecommunications firm, ISP or applications provider (e.g. cloud service provider, video streaming service) invests in resilience to enhance its own service continuity and reputation, and to avoid direct financial losses from downtime. However, from an infrastructure operator’s narrow perspective, if a competitor’s network fails simultaneously, it might reduce competitive pressure or even drive customers to alternative services if interconnection arrangements permit. Therefore, companies may not have strong incentives to coordinate resilience strategies to ensure that not all fail at once. This asymmetry of incentives could contribute to a resilience shortfall at the system level due to the emphasis on efficiency and profit rather than on redundancy and resilience. In the case of interdependent critical infrastructures such as broadband and power, Buldyrev et al. (2010) have shown how vulnerable a coupled system becomes even when both infrastructures are quite resilient individually – as with the blackout affecting much of Italy on 28 September 2003. A more recent example is the electricity distribution outage across much of Spain and Portugal on April 27–28, 2025 (Pinedo et al., 2025).
System-level resilience – ensuring that at least part of the telecommunications infrastructure remains operational during a crisis – has characteristics of a non-excludable and non-rival public good. Society benefits from the availability of communications (e.g. for emergency coordination, information dissemination, and economic stability), regardless of the particular company providing it. Consider a country or region with two providers of (fixed or mobile) broadband service. Should operator A suffer complete failure, operator B will be able to carry the high-value services (and production reliant on those services) that it used to enable as well as the high-value services previously dependent on operator A. It would do this by either using excess capacity or by throttling its lower-value services. This would be analogous to what happened during the 2020 pandemic (Stocker et al., 2023) when some low-value traffic was reduced on all networks in order to prioritize video conferencing for school and work as the physical contact channel was all but shut down.
The resilience of the telecommunications ecosystem involves intricate interdependencies. Networks rely on shared infrastructure (e.g., fiber routes, data centers, common utilities), and failure in one segment can cascade to others. Yet, if the social cost of a systemic collapse (both networks failing) is not directly borne by each individual firm, there might be little direct financial incentive to invest in measures that ensure collective resilience – such as shared backup infrastructure, diversity in routing, or cooperative contingency planning (Cedergren et al., 2019). Consider again a region with two operators with individual probabilities of total outage (for a critical and fixed amount of time) of PA and PB respectively. The probability of a joint outage depends on the correlation ρAB of the outages and is given byPAPB+ρABPA(1−PA)PB(1−PB)That is, the joint outage is strictly more likely than PAPB whenever ρAB>0. The product PAPB is the likelihood of a joint outage only in the hypothetical case that outages were completely independent. There are, after all, always possible events that take down both networks. The joint probability is obviously linear in ρAB. The importance of resilience of the individual networks to society depends on ρAB. However ρAB has little or no impact on the individual firms’ thinking about resilience for their individual networks.
In some markets, strong competition may push all players to ensure their networks are highly resilient, as downtime translates directly into brand damage and lost subscribers. Even in the case of a joint outage, some brand value could be gained from being the first network to be up again. However, the individual operators cannot necessarily be expected to be much more concerned about a joint outage than about an outage of their own network. In fact, they might prefer a joint outage to an own-network outage as the joint outage would not negatively affect their brand alone. High resilience within each network might not therefore automatically guarantee acceptable system-level resilience because system-level outages will occur. For this reason, diversity and coordination (ensuring that critical functions are not all reliant on the same weak link) become crucial, especially if all networks share the same single points of failure (e.g., a critical submarine cable route or a single data center cluster).
2.3. Network resilience: a co-ordination matter?
The need for co-ordination to manage resilience of multiple network systems invokes another tension: that between centralization and decentralization of control.
By engineering design, Internet Protocol (IP) networks rely heavily on centralized functions such as the domain name system (DNS) or border gateway patrol (BGP), the failure of which can result in cascading effects through the internet ecosystem (Owen et al., 2024). Telecommunication networks are also highly reliant on time signals from the Global Positioning System (GPS), often using single-frequency receivers (Lombardi, 2021). Co-ordination using physical standards plays a vital role in facilitating the interaction of systems owned and operated by diverse interests. Nevertheless, design decisions and operational interaction at the points where networks interconnect with each other must necessarily rely on physical as well as commercial co-ordination between the relevant operators.
Historically, governments have been involved in specifying both the capabilities and locations of networks, and this has extended to the setting of technical standards and service level obligations that influence network resilience. Governments and regulators may set standards or impose requirements to ensure diversity and redundancy at the macro level through.
- Regulatory mandates: Requirements for interconnection diversity, backup power standards, or minimum levels of robust routing can raise the baseline resilience for all.
- Incentive structures: Subsidies, tax breaks, or pooled insurance mechanisms can offset the cost of building resilient inter-network links or shared emergency infrastructure.
- Collaborative frameworks: Industry consortia and public-private partnerships can foster a shared understanding that systemic resilience is beneficial. Voluntary guidelines, best practices frameworks, and sector-specific mutual aid agreements can emerge when all parties recognize the larger stakes.
Governments are also likely the best parties to plan and co-ordinate immediate responses and recovery when external events – e.g. abnormal weather, earthquakes – affect multiple network operators simultaneously, even though it may remain the network operator’s responsibility to anticipate such disruptions and restore their own infrastructures when fractures occur. This is due to governments’ superior capabilities to compel specific actions and their access to resources beyond those available to individual firms (e.g. taxpayer-guaranteed financing for recovery). Yet, the greater is the extent of government specification and management of disaster recovery (or any other form of centralized control ad management), the less is the capacity of the individual firms to respond flexibly and/o nr innovatively to the crises they directly face. This limits the ability for new or novel responses to occur – something often necessary given the inability for anyone to know in advance precisely what challenges will be posed by unique and inherently unpredictable circumstances. Recovery is not defined as resuming exactly the prior state before the shock, but may involve changing, adapting to new conditions and improving the system’s functionality over time (OECD, 2019). Thus, resilience must also embody the need for adaptability and learning, fostering the evolutionary and emergent behavior characteristic of complex systems.
2.4. Governance frameworks for resilience
The OECD has taken a self-declared systems-thinking approach when developing its 7-point inter-related policy challenges for its Policy Toolkit on Governance of Infrastructure Challenges (OECD, 2019, p. 106). It recommends.
- Setting up a multi-sector governance structure for critical infrastructure resilience.
- Understanding complex interdependencies and vulnerabilities across infrastructure systems to prioritize resilience efforts.
- Establishing trust between government and operators by securing risk-related information-sharing.
- Building partnerships to agree on a common vision and achievable resilience objectives.
- Defining the policy mix to prioritize cost-effective resilience measures across the life-cycle.
- Ensuring accountability and monitoring implementation of critical infrastructure resilience policies.
- Addressing the transboundary dimension of infrastructure systems.
It builds on earlier work identifying that resilience is achieved via the combination of key qualities (OECD, 2011).
- Robustness – remaining functioning requires systems to be strong enough to withstand (be maintained to a standard that allows) low probability but high-consequence events;
- Redundancy – keep operating through substitute or redundant systems should critical components cease to function;
- Resourcefulness – the ability to skillfully manage a shock event as it unfolds: identifying options, prioritizing control of damage and beginning mitigation, communicating decisions to those who swill implement them. Resourcefulness depends primarily on people, not technology.
- Adaptability – absorbing new lessons that can be drawn from a catastrophe.
OECD (2019) provides a high system-level template for a good high-level governance process, taking into account the need for an all-hazards and threats, risk-based life-cycle scope with co-ordination across multiple sectors. Importantly, it takes account of the international (transboundary) challenges inherent in an international internet-based economy. Its risk- and threat-based approach has rendered it useful to addressing specific issues, such as cybersecurity or climate threats. This suggests it might offer a promising framework for governing resilience policy and operation in the complex internet ecosystem.
However, it is constrained by its infrastructure focus confining its relevance primarily to Qi & Mei’s technological networks. It is not clear how it addresses the interrelationships between the physical, commercial, informational and social elements that comprise the complex internet ecosystem. Nor does it explicitly consider the gap between the efforts of individual actors to address their own proximate infrastructure and operational resilience externalities arising from the consequences of interruption to their services to system actors in other parts of the ecosystem. It also presumes a degree of government and regulatory direction and control in points 5 to 7, apparently minimizing the extent to which non-government actors govern investment and information flows, and the extent to which government actions in respect of other activities can both enable and constrain the pursuit of resilience.
As the internet ecosystem has increased in complexity, as a greater number of individual networks, across all three layers of the ecosystem, the potential points and costs of failure have multiplied exponentially. In this evolution, the demands for resilience have also multiplied, along with potential gaps where the resilience of the complex internet ecosystem system and its individual components will be challenged: in effect, a magnified “missing market” for internet ecosystem resilience policies and operational plans.
3. Challenges to internet ecosystem resilience
While it is impossible in an uncertain environment to predict when and where specific challenges will arise, or precisely what their form will be, past experiences have enabled classification of two generic sources of threat to internet ecosystem resilience: to the physical (technological) elements of the ecosystem; and to the informational (data and software) elements. A subset of cybersecurity threats attend to the use of software applications to cause deliberate disruption to the flow of traffic across specific physical nodes of the internet system. Experience of cybersecurity attacks has led to the development of coordinated strategies to dissuade, detect and recover from such events; however they too stop short of considering the full public costs of such challenges to internet ecosystem resilience.
3.1. Physical
Threat to physical infrastructure operation is grounded in the experiences of telecommunications infrastructure operators, whose network resilience has been challenged by natural phenomena such as weather, forest fires and earthquakes, and intentional human-induced vandalism, resulting in the severing of physical links in the networks. (Blake et a., 2019). Examples include recent forest fires in Australia, Canada and the United States, earthquakes in New Zealand and Japan, ice movements in Arctic regions severing cables, and deliberate damage of undersea cables and other infrastructure for geopolitical motives (Brown et al., 2017).
Arguably, resilience to physical interruptions is both easier to anticipate and plan for, and to manage, than resilience to software interruptions. The responsibility for managing a crisis event is necessarily devolved to the locality where it occurs, even though its consequences may play out more widely. Furthermore, planning can anticipate some challenges and mitigate them. For example, in Canada (Public Safety Canada, 2023) and Australia,2 central planning for unexpected power outages at cellphone towers extends to battery backup being installed (with government funding in some cases). Moreover, in Canada, company-owned mobile generators can be deployed at towers to provide services when batteries have been depleted or damaged during a crisis, in order to maintain essential communications. Likewise, use of satellite connections by first-responders ensures essential communications occurs even when other network links may be vulnerable (Bell Canada, 2024).
To some extent, the engineering design of the internet as a complex mesh network with a focus on redundancy and diversity, modularity and connectivity, robustness and stability, autonomous organization and scalability confers significant resilience (Bischof et al., 2023; Firesmith, 2019). For example, when the Quintillion undersea fiber cable was cut in the (northern) summer of 2023, internet access to parts of rural Alaska was constrained, leading to lower data speeds in some locations as traffic was diverted across other links, but applications (and contingent commercial and social interactions) remained functioning – albeit at lower quality (Grindal & Meinrath, 2024).
However, there will always be some parts of the network where, by design or due to cost constraints, there is insufficient redundancy to maintain connectivity in the event of connection rupture. Localized disasters such as earthquakes and floods can expose these vulnerabilities. For example, a flood in February 2023 on the East Coast of New Zealand’s North Island swept away a bridge carrying all fiber backhaul links serving a population of over 200,000, rendering all internet access over both fixed and mobile connections and most commercial interactions (including all point-of-sale and automated teller transactions in an effectively cashless community) impossible for several days (Howell, 2024; Li et al., 2024). This particular vulnerability was exacerbated (or arguably even facilitated) by a number of Government policies encouraging use of shared infrastructure (mandated sharing of rights-of-way; a rural broadband funding arrangement3 subsidizing otherwise-competing fixed and mobile network operators to share spectrum, towers, backhaul and other resources). By prioritizing low-cost and wide-ranging connectivity, network resilience was relegated to a secondary position – if indeed it was considered at all when government funding was provided and tenders submitted. Ironically, availability of back-up generation at cellphone towers in the event of electricity outages had been anticipated and was provided by the network operators, but proved impotent in this disaster, reiterating the fact that managing for known risks is not the same as managing for uncertain outcomes (Logan et al., 2024).
The New Zealand flood example highlights the inherent resilience tensions between pursuit of low-cost connectivity and investment in redundancy, and how centralized control conflicts with local responsiveness. It also illustrates the inability for individual network operator firms to address the resilience needs of those whose social and commercial connections (via physically distant applications, in some cases operated by parties quite unaware of the plight of their local customers) were severed for days., when they provide the infrastructure but not the applications. These are manifested in the “social costs” of the outage – which cannot be addressed by relying on private incentives alone to address network resilience. Local innovation occurred, but it came despite, not because of, centralized co-ordination of either planning or disaster management. It came from a rival satellite network operator donating equipment and connections (dropped in by helicopter due to bridge outages) enabling essential retailers to resume limited digital operations necessary to facilitate physical local recovery (e.g. selling essential cleanup supplies and fuel to run recovery equipment) (Speidel, 2023). It was insufficient to address the total social loss, but, like the Quintillion cable outage, it allowed a basic level of interaction to take place until full service could be restored and provided valuable learning to inform future modifications of both individual firm and coordinated multistakeholder resilience planning.
3.2. Software
Threats to the internet ecosystem from software origins (essentially due to events in the information and application layers of the physical system) are comparatively newer phenomena than physical threats. However, as increasingly more activities of the physical infrastructure are managed by digital software-controlled devices, and greater amounts of data and information processing are moved from the physically-proximate locations where activities take place to ever-more centralized locations (cloud-based storage and processing; software-as-service providers) away from the remote ends where transactions physically take place, the greater are the calls for attention to be paid to “software and data resilience”.
Some of the highest-cost interruptions to the internet ecosystem in recent years have been due to software failures. The July 2024 CrowdStrike software upgrade disaster illustrates. A faulty configuration in an upgrade to a widely-used CrowdStrike application intended to enhance security by targeting newly-observed malicious activities caused a sequence of crashing, rebooting and bootlooping, or rebooting into safe mode, across a wide range of systems employing Microsoft operating software – predominantly those used by organizations, and notably, Windows virtual machines on the Microsoft cloud platform used in large data centers (Umbelino, 2024). The effects cascaded across the world’s air and ground transport, finance, healthcare, media, retail and other industries (Alvarez, 2024). While the error affected approximately 8.5 million devices (less than 1 % of Microsoft’s global installed base – Kerner, 2024), the effects were amplified by the extent to which both data storage and processing – once decentralized processes – are becoming increasingly centralized and highly reliant upon the efficient transportation of vast amounts of data to and from decentralized locations. A small number of failures can have an extremely large effect, and in these instances the responsibility for internet system resilience is beyond the scope of telecommunications networks and their operators alone.
Cloud operators have obligations for maintaining the integrity of their own systems, but they do not necessarily manage the relationships with the end users of applications using their services and made dysfunctional by the software-initiated outages. These relationships are mediated through complex contractual and software-managed applications and tools, including international payment mechanisms, that enable transacting between parties in completely different parts of the world, who may be unaware of who is managing their transactions and exactly how these interactions occur, or even where in the complex internet web of infrastructure they take place. Each relies on a complex nexus of interactions that, for the most part, work but due to the colossal scale at which these transactions are aggregated, cause catastrophic consequences when they do not. Because it cannot be known in advance which links in the complex nexus will fail and who will be harmed, it is impossible to know what specific actions to take to make the system resilient to specific threats. When an outage occurs, the likely only course of action is to rely on the ‘best efforts’ of the relevant parties to get their parts of the system up and running again as soon as possible – usually by reverting to an earlier version of the software concerned. The costly consequences to end users are typically not a consideration of individual operators– these users too are left to rely on their own ‘best efforts’ – either insurance payments to compensate for loss of earnings (if the entities concerned can get coverage for these sorts of events) or self-insurance (absorbing the losses if they cannot get, or cannot afford, insurance). At best, an extensive post-hoc inquiry can be conducted to understand why the outage occurred, but once identified, the probability of the same outage occurring again returns to the same or even lower small number that prevailed before the original unanticipated outage.4
Furthermore, software failures can occur because of apparently prudent efforts taken to make individual elements of the system more “resilient” in the first place (e.g. placing limits to preclude operating outside predetermined parameters which are considered sufficiently unusual or dangerous to warrant caution). But because these actions are taken mostly without knowledge of similar limits and restrictions imposed by other parts of the system, their subsequent interaction can lead to unexpected, unintended and costly outages.
The Australian November 8, 2023 Optus network outage following a routine software upgrade at North American Singtel illustrates (Howe, 2023). Over 10 million people and 400,000 businesses and government entities across Australia were affected (Jose & Kaye, 2023). The Senate Standing Committee on Environment and Communications instigated an inquiry, at which it was revealed the outage arose from a gradual event triggered by loss of connectivity between neighboring computer networks. Approximately 90 edge provider routers disconnected following a spike in messages from the CloudFlare security system indicating a Border Gateway Protocol routing problem. The spike was traced to the disconnection of one router following the Singtel upgrade. This triggered Optus’ routers to rapidly update Optus’ routing tables, which in turn triggered the shutdown due to exceeding of the default threshold set by Cisco Systems for the Optus network (Optus, 2023). Optus and its affiliated providers managed to restore services fully after around 12 h (customers were gradually able to access services from 6 h after the initial problem was identified). However, the commercial ramifications were significant: the company’s share price fell by around 5 %, and rival firms Telstra and Vodafone (TPG) saw a significant upswing in customer numbers (Williams, 2023). Compensation amounting to around $1.2 million in cash and service credits was provided to Optus customers (Crozier, 2024), but those relying on services provided by these Optus customers bore mostly uncompensated costs (for example, a bus passenger relying on real-time information provided by a bus company customer of Optus to plan a journey to an urgent medical appointment). On 20 November, the CEO resigned.
Increasing reliance on software necessarily leads to increased vulnerabilities when software managed at multiple locations by different entities interacting with a wide variety of other applications is upgraded. Both unexpected complexities and human errors in response to them can arise; the probability of such events occurring can only increase as the number of potential failure points increases. The Rogers Communications outage during a software upgrade on July 8, 2022 affected millions of customers, disrupted payment systems, internet access, and emergency services across multiple provinces. The subsequent inquiries by the CRTC (Canadian Radio-television and Telecommunications Commission, 2024) and parliamentary committees—along with industry-led reviews—highlighted the significance of cross-network interdependencies, backup arrangements, and timely incident reporting. Recommendations included improved transparency, redundancy, and mutual aid agreements between providers (Gannon, 2024). But these will be effective only to the extent that coordinated interaction and use of the now-transparent information can be achieved.
However, the same risks apply within individual enterprises when, inevitably, enterprise software must be upgraded. A particular instance of a failed software upgrade in February 2025 at the author’s university led to a less-widespread, but nonetheless costly, outage due to prior decisions to transfer all data and most application hosting out of the university to a cloud provider and institute a policy (for local convenience) to prevent files being synchronized to locally-accessible devices without prior permission. Successive device upgrading under the policy meant that when the outage occurred, most academics were left with no access to any of their current or legacy files (e.g. Word, Powerpoint, Excel documents) and the (Microsoft) applications typically used to communicate with others (e.g. Outlook, Teams) for the 7 h it took to restore the system – effectively crippling both teaching and research activity. Centralization to cloud systems is undertaken to reduce local technical work (cost savings), but this comes at the expense of local institutional resilience. A similar loss of resilience occurs when software is provided as a service, and services are lost during botched upgrades.
While the individual consequences of such outages are small in comparison to the Rogers and Optus outages, the fact that they are likely replicated daily (with increasing frequency as digitization of activities occurs) somewhere in the world constitutes an aggregate loss that is not at all trivial. Yet as the costs of data storage have fallen faster relatively than the costs of data transport, and the sophistication of synchronization software has improved markedly in that same time, it would appear prudent to consider the resilience benefits of enabling at least some file and software duplication locally to enable operation when the center is unavailable (Howell, 2025a,b, p. 2025a).
3.3. Cybersecurity and resilience co-ordination
To some extent, software resilience has been considered nationally and internationally under the aegis of “Cybersecurity”. The security lens emphasizes threats and deliberate attacks, and focuses on rapid detection, isolation of effects and efficient recovery of affected infrastructure (World Economic Forum, 2024), albeit recognizing that the societal impacts of cyber threats necessitate protection of those infrastructures as a matter of safeguarding public wellbeing (Chen et al., 2020; Idengren, 2024). Importantly, the development and management of effective cybersecurity are seen to require strong foundations in collaboration and co-operation between a wide range of stakeholders and be embedded in the governance and strategic frameworks of all system participants, thereby echoing the OECD’s policy toolkit for the governance of infrastructure challenges (OECD, 2019).
The World Economic Forum identifies the ecosystem-wide scope of cybersecurity measures, at the level of both individual firms and at higher levels of aggregation (i.e. industries and national governments). They advocate collaboration to ensure resilience efforts extend beyond individual organizations as a means of strengthening the internet’s foundational infrastructure. Such collaboration should include.
- Addressing common points of failure through collective risk mitigation;
- Pooling cybersecurity expertise to bolster entities lacking internal capabilities; and
- Investing in talent development to close skills gaps.
In this perspective “organizations, regulators and policy-makers (at international, national and regional levels) should co-operate to develop regulations that support and incentivize cyber resilience” yet “none of the above negates the fact that organizations are individually responsible for managing the risks to their own primary goals and objectives.” (WEF, 2024 p 13). However, it is acknowledged that these measures represent just the start of an iterative process.
The embryonic nature of national and international approaches to cybersecurity management and potential for fragmentation is illustrated by Australia. In this country, critical infrastructure is considered to be physical facilities, systems, assets, supply chains, information technologies and communication networks, which, if destroyed, degraded, compromised or rendered unavailable for an extended period, would significantly impact the social or economic well-being of Australia as a nation or its states or territories, or affect Australia’s ability to conduct national defense and ensure national security.5 The Department of Home Affairs operates the Cyber and Infrastructure Security Centre, which formulated the Critical Infrastructure Resilience Strategy and the Critical Infrastructure Resilience Plan6 in 2023. These provide an overarching resilience framework, including co-ordination of a Trusted Information Sharing Network where federal, state and local government and industry can engage collaboratively, apparently giving effect to the sorts of arrangements advocated by the World Economic Forum.
In practice, however, policy implementation for general system resilience appears to be fragmented across different government departments, with a clear split between cybersecurity oversight and oversight of infrastructure resilience more generally. Cybersecurity is overseen by the Australian Signals Directorate7 while infrastructure resilience investment is coordinated from the government’s perspective via the Department of Infrastructure, Transport, Regional Development, Communications and the Arts.8 The broad scope of the Department’s portfolios suggests a strong position for integrating activities across multiple sectors. However, most of its initiatives appear to be focused on individual infrastructures and in response to (and in anticipation of) specific disasters, such as floods and forest fires. In respect of telecommunications and internet networks, it administers a range of government grants for specific purposes. The vast majority of these projects form part of the Strengthening Telecommunications Against Natural Disasters (STAND) announced as part of the Government’s $650 million bushfire recovery funding package announced in January 2020. This is in addition to the Regional Connectivity Program, funding infrastructure in non-economic regions where private investment is not forthcoming.9
4. Gaps in internet ecosystem resilience
At the outset of the paper, the following questions were posed.
- Is it still sufficient to rely primarily on the resources of infrastructure providers to guarantee the resilience of the internet ecosystem for the public good? And
- If not, how should the nexus between physical infrastructure resilience and internet ecosystem resilience be funded and managed?
The literature review identified that while conceptual differences might exist between the physical network infrastructure systems underpinning the internet and the internet ecosystem as a whole, on practice these are very deeply intertwined. The engineering design of the internet infrastructure facilitated decentralized control of traffic routing, underpinning the original concept of an internet ecosystem where data transport could be considered completely separately from the applications using that data, (the “dumb pipes” in the core and the “smart applications” at the edge).
However, the complex intertwining of physical data movements with the complex economic and social systems utilizing that data means that they could never be so neatly separated. It is naïve to presume that any one set of stakeholders in the complex internet ecosystem should alone be responsible for guaranteeing system resilience. The historic reliance upon telecommunications providers alone to fund physical infrastructure resilience is shown to be insufficient to protect the public good. The public good relies upon the efficient operation of each individual part and the interconnections between the parts, at all layers of the ecosystem.
4.1. Infrastructure resilience
As a public good matter, it is appropriate that governments (and arguably pan-government entities) should take the lead in the necessary coordination to address internet ecosystem resilience (Chevalier et al., 2024). That this has been recognized in principle is reflected in the OECD’s recommendations, which take a systems-thinking perspective and therefore cover a wide multi-sector approach when considering the issue of infrastructure resilience. Such an approach implicitly recognizes that relying only on individual actors (notably network operator firms) to manage their own resilience leads to insufficient effort for the social good. Consequently, it is appropriate that funding from other sources (e.g. taxpayer funds) be used to supplement the efforts of these individual actors in making their own systems resilient (e.g. grant funding for solar batteries at cellphone towers, as observed in Australia).
However, the weakness of these guidelines is that they address only the inner two layers of the internet ecosystem: they do not take account of the linkages between end users, applications and infrastructure. Hence, they were unable to satisfactorily address the wider social costs arising from either physical (for example, fires, storms and cable severing) or software-induced interruptions (e.g. CrowdStrike, Optus) to infrastructure services. Even though governments may take a lead role in coordination, their local focus (national or regional) tends to place emphasis on participation solely by stakeholders over which they have legal jurisdiction. Consequently, they are unable (or unlikely) to include (particularly) international software and applications providers without local representation who impose risks and costs on local resilience.
The result is that one group of stakeholders does not participate in funding the resilience of infrastructure serving customers from which they are deriving income and profits. In effect, this group free-rides on the resilience efforts of the physically-proximate parties. They participate in the benefits when all goes according to plan, but they do not share in the costs of ensuring the resilience of these systems. They can play a role in recovery from disruptions, whether they are due to these parties (e.g. CrowdStrike) or not (the satellite operator during the New Zealand flood) but cannot be obligated to participate. This shifts a disproportionate share of the costs and risks of resilience onto local stakeholders; in most cases, as the relevant firms have resources to manage these risks, but the unmet costs almost always fall on individual end-users with fewer means to manage the consequences (such as the New Zealand flood victims, unable to undertake even basic commercial transactions to begin their recovery). Fundamental risk management theory suggests a greater sharing of the local risks with international application providers is indicated if a more efficient and resilient internet ecosystem is to be achieved.
4.2. Cybersecurity and resilience
Likewise, cybersecurity measures, as recommended by the World Economic Forum, and evidenced in national systems such as Australia’s, also offer a promising step forward. They too emphasize the need for a wide-ranging multi-holder approach and, recognizing the externalities arising, have been led by governments and industry associations with the power to compel participation (government regulations and/or the contractual obligations of association membership). Advantages of cybersecurity measures over those focusing on infrastructure resilience include: membership of layer 3 applications providers in their stakeholder scope; and an international focus that recognizes that the internet ecosystem does not and cannot respect national borders (Bauer & Van Eeten, 2009).
However, cybersecurity measures also pose weaknesses to maintaining internet ecosystem resilience. Their focus on security matters and their focus on deliberate attempts to disrupt the ecosystem (e.g. distributed denial of service attacks) renders them in a poor position to anticipate or address truly unintentional (“accidental”) outcomes arising from the unpredictable effects of multiple complex systems interacting with each other in ways that cannot be easily foreseen (e.g. CrowdStrike; Optus). Consequently, these events can “slip through the cracks”. While the World Economic Forum guidelines provide advice to large firms about guarding against cybersecurity threats, they cannot provide guideline for addressing threats that have not yet even been identified or understood.
It is not clear whether addressing these novel events where there is no specific identifiable cause should be included in the responsibilities of cybersecurity measures, or whether they should be addressed as a separate matter. An advantage of including them is that for the most part, cybersecurity measures include relevant stakeholders at the applications and network operation levels of the internet ecosystem. However, they do not specifically address the interests of end-users and the wider social costs to them of interruptions. Hence they may not be in a good position to identify end-user concerns and act upon them (for example, the low-level but frequent and costly interruptions occurring during failed local software upgrades).
5. Implications for policy and practice
As increasingly more economic and social interactions have moved from the physical to the digital word, greater emphasis must be placed on measures addressing the resilience of the internet ecosystem. Historic resilience measures used for telecommunications infrastructures, where each operator was responsible for the resilience of their own networks, will not be sufficient. Neither will the measures introduced to address cybersecurity matters. Because the internet ecosystem is a complex web of multiple complex technical, information and social networks, impacting on almost all aspects of human existence, the matter of internet ecosystem resilience borders now upon the matter of human resilience more generally.
While measures taken to address infrastructure and cybersecurity resilience are commendable for their endeavors to bring all stakeholders together and utilize collective wisdom, two important gaps still appear to remain: first, the obligations of applications providers to contribute towards the resilience of the infrastructures they utilize, but do not have specific commercial interactions with, in order to serve their customers’ and broader societal needs; and second, the responsibilities of end users to contribute as well towards managing their own risks when interruptions occur.
The question of application providers’ responsibilities towards funding infrastructure resilience is an extension of the questions raised in the network neutrality debate, where it has been argued that applications providers should not pay network operators for data traffic delivery. While there may be some plausible support for this argument on the level of individual data transmissions, applications providers share in the benefits generated by these networks and, in the interests of the public good, should contribute towards their resilience, in a similar manner as is being called for with fair cost recovery by network operators from video streaming firms (e.g. Layton & Potgieter, 2021). While the difficulties of enforcing national laws on firms in foreign jurisdictions is problematic, there is scope for existing forums such as cybersecurity panels to begin a conversation on addressing these responsibilities.
Nonetheless, scope exists for more to be done by end users – both individuals and firms as consumers of internet-mediated services to consider how they can make themselves more resilient in the event of interruptions. Just as general disaster preparedness actions (such as having emergency supplies of food, water, cash and other essentials) enables some resilience to physical isolation (e.g. when a bridge or road is severed), it behooves these users to plan for internet interruptions. It also behooves the suppliers of the applications used by these individuals to facilitate their resilience it at all possible. For example, when applications are digitized, consideration should be given to retaining the knowledge and capacity to carry out a rudimentary process non-digitally, if the process is likely to be critical to operating in the constrained environment. This would be the equivalent of a customer holding a store of cash for emergency purchases and for a retailer, maintaining the capacity to handle cash transactions if this is necessary. In the event of a disruption, the issue is not one of functioning as efficiently as before, but the simple act of being able to function at all.
A specific responsibility attends here to providers of cloud services and their customers. Cloud operators claim greater efficiencies for customers buying their services, but they seldom address the matter of the resilience of the customers whose data and programs they now host. The reason is twofold: highlighting the risks would disincentivize customers; and addressing the matter reduces the savings “on the table” for the customer and the profits “on the table” for the cloud operator. Yet the failure to address these matters proved catastrophic during the CrowdStrike outage. Airports could not process check-in or baggage handling because both applications and data were hosted on cloud servers. Disruption could have been reduced (although not eliminated) if some rudimentary programs were available locally, and/or key data aggregated on some timeframe (e.g. daily or hourly) was transferred to the locality as a precaution. In a similar manner, the author’s university could have avoided much of the costs of its outage by relaxing its policy by allowing files to be held on individual computers. The Microsoft Cloud software already allows for seamless digital synchronization of data held in this manner without need for additional technical support except at the initial setup of a device.
Of course, these measures require firms and individuals to confront the question of the costs and benefits of maintaining this additional functionality. If the costs and risks are low, then doing nothing and bearing the losses on the event of an unexpected outage may be optimal. But this requires some understanding of what those costs and risks might be. For the most part, the costs and consequences are not appreciated until an event occurs, and by that time it is too late to avert them. At best, the learning from these events needs to be aggregated and shared to allow better decisions to be made, using the best available data rather than responding to fears or falling prey to complacency. A clear gap exists for more research (for example, quantifying the costs of outages, developing frameworks for evaluating costs and benefits of different applications) and education. The responsibility for this rests with both government and industry, but co-ordination of the effort needs to occur at all of local, both national and international levels.
6. Summary and conclusion
The internet ecosystem is a complex interweaving of multiple complex adaptive systems operating across all three conceptual internet layers. Moreover, as a complex, adaptive system itself, it is evolving dynamically, as increasingly more elements of human interactivity across an ever-wider array of commercial, social and other spheres. While historic responsibility for individual network resilience could have been appropriate at the level of individual infrastructure providers when they mediated almost all internet activity across only a small number of limited applications, that does not apply for the contemporary internet ecosystem.
While not diminishing the important contributions to resilience made by individual telecommunications network operators, the complex ways in which economic and social systems now depend crucially on the efficient functioning of a complex internet ecosystem places additional demands on internet resilience that go beyond the functioning of physical data transportation alone. It involves all participants across all spheres. It is insufficient to rely on the incentives facing any single operator alone, in any of he spheres of influence. This is a classic tension between private incentives and societal welfare: the resilience of the internet ecosystem is a public good and will require the coordinated actions of all stakeholders if it is to be adequately addressed.
The wide-ranging scope of interactions, spanning local, national and international domains, poses challenges to effective resilience planning, investment and management. Clearly, governments and pan-government entities are well-positioned, and have already taken some steps to facilitate the development of a more resilient internet ecosystem. This is most evident in cybersecurity measures (for software-based threats) and local physical disaster deterrence and recovery. However, changes to the nature of commercial transactions on the internet away from local infrastructure providers towards international providers, not least of which is the increasing centralization due to cloud computing, are creating new risks and challenges to be addressed. The costs of building and maintaining resilient infrastructures are increasingly disconnected from revenues and resources derived from their use. New vulnerabilities are being created as data and software moves away from the location where its applications are manifested. For the new vulnerabilities to be addressed, new discussions and new policies are required.
The challenge for resilience policy in the future will be in addressing the responsibilities for funding and operationalizing the resilience of the ecosystem elements – across all layers – to support the needs of end users. The social and commercial systems on which they rely are now so closely interwoven into the internet ecosystem that it is increasingly becoming a challenge to the resilience of society itself. Ultimately resilience will be achieved from a combination of individual and cooperative actions – by individuals, firms, network operators, industry governance, and government governance