Introduaction
In the constantly evolving digital landscape, the concept of network resilience has emerged as an essential facet of maintaining reliable and efficient systems. Network resilience refers to the ability of a network to maintain an acceptable level of service in the face of various challenges and disruptions. These could range from cyberattacks to hardware failures, and from human error to natural disasters. The growing reliance on digital systems for everything from business operations to personal communications has made the robustness of these networks crucially important.
However, enhancing network resilience is no small task. Networks are complex systems, often spanning vast geographical areas and comprising numerous interdependent nodes and links. They must contend with an array of potential threats, each of which can cause significant disruptions if not appropriately managed. This reality makes the quest for network resilience an ongoing and evolving challenge.
In response to these challenges, a groundbreaking framework known as the ResiliNets architecture has been developed. This architecture, designed to enhance network resilience, draws on a set of axioms, a resilience and survivability strategy, and a set of supporting principles to bolster network performance, even in the face of adversity.
The ResiliNets architecture is the brainchild of a trio of accomplished scholars: Prof. James P.G. Sterbenz, Prof. David Hutchison, and Dr. Justin P. Rohrer. Their collective expertise spans the fields of Electrical Engineering, Computer Science, and Economics, and their work on network resilience has been widely recognized and adopted. Prof. Sterbenz, in particular, was known for his pioneering work on high performance networking and his focus on network resilience in his later years1.
This article aims to delve deeper into the ResiliNets approach to network resilience, exploring its fundamental principles, highlighting its relevance in the context of Operational Support Systems (OSS), and shedding light on the significant contributions of its creators. The goal is to provide a comprehensive understanding of how this innovative architecture can serve as a roadmap for building and maintaining resilient networks in an increasingly digital world.
The ResiliNets Axioms
The ResiliNets architecture is grounded in four resilience axioms that set the foundation for its resilience and survivability strategy. These axioms – Inevitability of faults, Understanding normal operations, Expecting adverse events, and Responding to adverse events and conditions – provide a roadmap for building resilient networks. They reflect the realities of network operation and the inherent challenges of maintaining network performance in the face of various threats and disruptions.
The first axiom, the Inevitability of faults (I), acknowledges that faults are an inevitable part of any network. No network is completely immune to faults, whether they arise from hardware failures, software bugs, human error, or external factors such as cyberattacks or natural disasters. Recognizing this reality is the first step towards building a resilient network. Once the inevitability of faults is accepted, strategies can be developed to mitigate their impact and recover from them effectively. For example, redundant systems can be implemented to ensure that a single fault does not bring down the entire network.
Understanding normal operations (U) is the second axiom. This involves establishing a baseline of the network’s normal operational parameters and behaviors. With this baseline, deviations can be easily detected, indicating potential problems. Understanding normal operations also aids in designing efficient responses to faults and adverse events. For instance, traffic patterns can be monitored to identify typical usage trends. An unexpected surge in traffic could indicate a DDoS attack, prompting immediate remedial action.
The third axiom, Expecting adverse events (E), is about anticipating disruptions. While faults are inevitable, not all are predictable. This axiom underscores the importance of being prepared for unexpected events that could disrupt network operations. Various threat and challenge models can be used to anticipate potential adverse events, enabling proactive measures to be taken to defend against them. For example, network operators might employ threat modeling techniques to predict potential cyberattack vectors and implement appropriate security measures in advance.
The final axiom, Responding to adverse events and conditions (R), focuses on the need to react effectively when disruptions occur. This involves remediation to minimize the impact of the adverse event and recovery to return to normal operations. A well-defined and practiced response strategy is critical to minimize downtime and maintain service quality. For instance, a network could have automated failover systems in place that kick in when a server goes down, ensuring a seamless experience for the end users.
Collectively, these axioms provide a comprehensive and pragmatic approach to network resilience. They recognize the inherent challenges of operating networks and provide a framework for anticipating, responding to, and recovering from disruptions. With the ever-increasing reliance on networks for business and personal use, the importance of these axioms and the resilience they facilitate can’t be overstated1.
Figur1 from: https://resilinets.org/
Real-Time and Background Control Loops
The ResiliNets approach establishes essential prerequisites for network resilience that include service requirements, normal behavior, threat and challenge models, metrics, and heterogeneity in mechanisms, trust, and policy. Understanding the normal behavior of the network is vital, as it provides a baseline for detecting when the network has deviated from its expected performance due to an adverse event. Metrics are crucial for monitoring the network’s performance and detecting any deviations.
Trade-offs are also an inherent part of the ResiliNets design. For instance, resource trade-offs might involve balancing the need for redundancy, which improves resilience but comes at a cost, against other requirements such as efficiency or cost-effectiveness. Similarly, complexity can enhance resilience by providing multiple pathways for communication, but it can also make the system more challenging to manage and increase the potential for errors. State management is another trade-off, with the need to maintain the state of the network to facilitate recovery after a disruption balanced against the storage and processing requirements of state information.
The ResiliNets approach identifies several key enablers of resilience. These include:
- Security and self-protection mechanisms, which defend against threats.
- Connectivity, which ensures the network remains operational.
- Redundancy, which provides backup resources in the event of a failure.
- Diversity, which reduces the likelihood of common mode failures.
- Multilevel strategies, which provide resilience at multiple layers of the network.
- Context awareness, which allows the system to adapt to changing conditions.
- Translucency, which provides visibility into the network’s operation while preserving essential security protections.
Figur2 from: https://resilinets.org/
Operational Support Systems (OSS) as mentioned by Ryan from Passionate About OSS, it’s indeed an integral part of network management. OSS provides a comprehensive suite of tools that manage and coordinate network resources, ensuring seamless communication. In the context of resilience, OSS can be seen as a critical enabler that aids in monitoring network health, diagnosing faults, and facilitating recovery processes. The ResiliNets approach, with its focus on understanding normal operations and responding to adverse events, complements OSS’s role in maintaining network operations.
Quantifying Network Resilience in ResiliNets
ResiliNets architecture provides a rigorous framework for quantifying network resilience. This framework focuses on two orthogonal dimensions of communication networks: the physical network characteristics, also known as the operational space (N), and the service requirements, or the service space (P).
The operational space represents the physical state of the network. Within this space, resilient networks aim to maintain normal operation in the face of challenges. This can be categorized into three levels:
- Normal operation according to network design and engineering
- Partially degraded but still operable
- Severely degraded, providing little or no operational capability.
Each level of operation has its distinct characteristics, and the goal is to prevent the network from moving from a state of normal operation to a state of severe degradation.
On the other hand, the service space represents the quality of service for an application over a given network. Here, resilient services aim to remain acceptable even when network operation degrades. Like the operational space, the service space can also be classified into three categories:
- Acceptable service with respect to service specification
- Impaired but usable service
- Unacceptable service that provides little or no utility
The objective here is to ensure that services remain within the acceptable range, even when facing adverse network conditions.
Resilience (R) in ResiliNets is defined as a function of state transition probability in this two-dimensional state space. Each dimension consists of a multivariate metric descriptor, and the network state (S) is a discrete set of operational metrics and service parameters. To limit the number of states, an aggregation is performed. The operational and service spaces are each divided into three regions, corresponding to the levels of operation and service quality mentioned earlier.
It’s important to note that these spaces are not static but dynamically change in response to various events and conditions. As such, ResiliNets’ approach to resilience involves not only maintaining operation within acceptable bounds but also taking proactive measures to adapt to changing conditions and recover from adverse events.
Conclusion
The ResiliNets architecture represents a comprehensive and systematic approach to understanding and enhancing network resilience. Byestablishing a set of guiding axioms, operationalizing them through control loops, and providing a framework for quantifying resilience, it offers valuable insights and tools for network administrators, researchers, and developers.
This approach is not only significant for its contribution to the field of network resilience but also for its potential impact on the broader realm of network management and design. The continued research and development in this area promise to further our understanding of network resilience and open up new possibilities for creating robust and reliable network systems.
Further Reading
For those interested in exploring this topic further, the works of Prof. James P.G. Sterbenz, Prof. David Hutchison, and Dr. Justin P. Rohrer provide an excellent starting point. Their extensive contributions in the field offer valuable insights into the intricacies of network resilience and the ResiliNets approach. Notably, their work on the ResiliNets architecture is comprehensively documented on the ResiliNets Wiki, which serves as a useful resource for both beginners and experts in the field.
In addition, Operational Support Systems (OSS) play a crucial role in network management and resilience. As Ryan, an enthusiast of OSS, aptly puts it, OSS are the “connectors and the profit engine behind any communications network“. His passion for OSS highlights the importance of these systems in ensuring the smooth operation of networks, further underscoring the necessity of network resilience in today’s digital age.
Network resilience is a vast and complex field, with many facets to explore. By building on the foundational work of pioneers like Sterbenz, Hutchison, and Rohrer, and by continually pushing the boundaries of our understanding, we can hope to create networks that are not only resilient in the face of challenges but also capable of adapting and evolving to meet the ever-changing demands of our interconnected world.
Acknowledgments
The authors would like to acknowledge the invaluable contributions of Prof. James P.G. Sterbenz, Prof. David Hutchison, and Dr. Justin P. Rohrer to the field of network resilience. Their pioneering work on the ResiliNets architecture has not only advanced our understanding of network resilience but also paved the way for future research and development in this field.
References
- Sterbenz, James P.G., Hutchison, David, Çetinkaya, Egemen K., Jabbar, Abdul, Rohrer, Justin P., Schöller, Marcus, and Smith, Paul. “Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines.” Computer Networks, 54(8), 1245-1265, 2010.
- Rohrer, Justin P., Sterbenz, James P.G., and Hutchison, David. “The ResiliNets Architecture.” ResiliNets Wiki. Available at: http://www.resilinets.org.
- “In memory of James P.G. Sterbenz (1956-2019).” IFIP News. Available at: https://www.ifipnews.org/in-memory-of-james-p-g-sterbenz-1956-2019.
- “Operational Support Systems (OSS).” Ryan is Passionate About OSS. Available at: http://www.ryanispassionateaboutoss.com.