Weathering the Storm Redundancy is key
By Patrick Hunter
As we continue to weather the coronavirus storm, it has become increasingly obvious for many of us that the availability of the network is as important now as ever. When we talk about ensuring that our remote users can reach the VPN infrastructure, or even the Internet for that matter, the topic of “high availability” comes to mind. Pondering that idea, it occurred to this author that now might be a good time to answer the question: what does it mean for a service or system to be highly available? Time for a deeper dive into that idea.
Continuous availability, resilience of networks and systems, and the concepts of uptime and downtime have become their own fields of study in recent decades. Naturally, as the use of computers has grown, reliance on their being available has also grown. There was a time when so long as the computers and databases were functioning during the local work day, the system was considered to be available as needed. Over time, due to our increased reliance on these systems and increasing global demand for their availability, the measurement of systems’ ability to remain available and resilient has matured.
Let’s look at this concept in its simplest terms. A single computer providing a service that is deemed critically important needs to function twenty four hours a day all year long to be considered 100% available. As soon as you read that sentence, your brain likely kicked in to consider the many other factors inherent in that claim. What about the software that runs on that computer? If the software experiences an issue causing it to crash or behave in an unusable fashion, and this software performs that critical function for the user, does it much matter that the computer stayed powered on and responsive to user input? If the manner in which the user interacts with the computer is remote in nature, such as through a network connection, and that connection is impaired or severed, does it matter that the computer remained powered on and the software was functioning as designed? If the chair in which the user sits to physically interact with the computer experienced a physical issue and was no longer available to the user…. okay, that’s a pretty extreme example, but it begs the question nonetheless: how do we measure availability? And, assuming we can even agree on the measurement, can we agree on the terms to feed that availability measurement?
Not surprisingly, there exists a wealth of information and scientific study on the subject. For those of us in the IP network world, we have our own views of availability and how to achieve it. We focus on the routers, switches, firewalls, and other components that help get the user to the computer. Each of those devices comes in all shapes and sizes to provide redundancy in the event of a component failure. Depending on the design of the network, any one of those components (or a connection between them) may cause an application to be unreachable or unusable should there be a failure. Therefore, our two biggest questions to answer with regard to design are: How do we keep that device functioning all the time? And, how do we keep the connection between devices available all the time?
Addressing the first question: For any single network device, keeping power flowing to the equipment is key. Redundant power supplies are pretty much a standard item on any enterprise- or carrier-class network device today. Of course, this also implies that redundancy with respect to power itself be considered, such as dual circuits feeding the device and perhaps other failsafe mechanisms, but I don’t want to digress here. Redundant processors within the network devices are also necessary, because what good is redundant power if a CPU or route processor fails, right? Multiple supervisor cards and other components round out the failure-proofing features as well. But, even with all of those features, sometimes the chassis itself no longer functions. It’s a reality that is inevitable in our world: It’s not a matter of if a device will fail, but rather when said device will fail.
So, how do we address that potential? By having multiple devices set to perform the same function. Internet routing since its inception had this concept in mind. The idea that routers could know the possible paths to a destination and always adapt to use the best available path was a key component of the design. Therefore, two routers that connect to both a source (user) and destination (computer or application) can provide redundancy. There is also a method of high availability often used in firewalls and load balance appliances (and sometimes even routers and switches) in which the two physical devices act as a single unit. It is common to refer to this configuration as an “HA pair” of devices. Both devices constantly communicate with each other in order to keep their information bases synchronized specifically in preparation for the possibility of some failure within a device or even externally, such as a network failure.
For the second question, we’ve actually already touched on the idea in the previous paragraph. At its most basic form, ensuring that there are at least two paths from a source to a destination is the premise of network path redundancy. But, as you might have suspected, it’s not that simple either. We must consider the physical and the logical paths here.
For example, a company can lease a circuit from a carrier to get from one place to another. There will be costs involved to construct the circuit initially, but once the physical path is established, leasing a second circuit is often much more cost effective. The most common example might be a fiber-based circuit from point A to point B. To add the second circuit, logical provisioning or the addition of a second wavelength on the same pair of fibers may be all that is needed to provide circuit number two. Bingo! Redundancy achieved, we have two “paths” from A to B. But, a single cut to the physical fiber somewhere between A and B would still take down both circuits.
This leads to the discussion of diverse path redundancy in circuits. It doesn’t take much to realize that now we are proposing two physical builds instead of just one, which increase costs. Following the logic to its conclusion, the most number of paths between any point A and B would provide the highest level of redundancy but at the highest cost. Time has proven that there is no way around this reality as physics and geography have both refused to bend to the will of engineers for centuries.
My critical infrastructure and change management friends will note that I didn’t take the time to discuss very specific systems of measurement for high availability like “five nines” (99.999% uptime) or mean time between failure (MTBF), which certainly apply here. Both for the sake of brevity and because of the notion I mentioned early on here (how to agree upon the measurement criteria), I have chosen to save those for another day.
Suffice it to say, when your leadership comes to you and says “This *stuff* can’t go down!”, make sure to consider all of the factors involved when designing an end-to-end solution so you don’t have to catch a bunch of…um…stuff from your boss.
Patrick Hunter — “Hunter”
Director, IT Enterprise Network and Telecom,
Charter Communications
hunter.hunter@charter.com
Hunter has been employed with Charter since 2000 and has held numerous positions, from Installer, System Technician, Technical Operations management, Sales Engineer, and Network Engineer. His responsibilities include providing IP connectivity to all users in Charter’s approximately 4,000 facilities, including executive and regional offices, technical centers, call centers, stores, headends, hubsites, and data centers. Mr. Hunter has served on the SCTE Gateway Chapter Board of Directors since 2005. He spends his spare time mentoring, teaching, and speaking on IP and Ethernet networks as well as careers in the network field.
shutterstock