Who is Watching the Watcher?
By Patrick Hunter
The monitoring of all of our devices, systems, applications, and other critical pieces of our networks has long been an important part of keeping everything running smoothly and, hopefully, without impact to the customer. Depending on the nature of “customer” in this context, it could be our paying customers who enjoy reliable, high-speed access to the Internet or other locations in their footprint, or it could be those who enjoy phone or video services. Additionally, for many of us, our customer may in fact be an employee at the company for which we work. In all cases, the criticality of ensuring the reliability of our systems has always been of the utmost importance. In today’s environment, all of these services are more important than any other time in our industry’s history.
But, it has long been a challenge to ensure that the right type of monitoring, or telemetry, is available, sufficient in depth, and functioning reliably and properly. That begs an interesting point, how important are the systems that watch all of our systems? An interesting idea when one ruminates on it. Common sense seems to dictate that the monitoring systems are among the most important in our environments. But, there is also another question that often goes unanswered to some degree, or even only partly answered mostly by implication, even in the most modern of network and service provider environments. That question is, who is the true accountable party with regard to ensuring that monitoring and telemetry are working the way they are supposed to work?
Now, many readers likely have that “duh” sort of expression as they read this. “Yeah, um…it’s the job of the “telemetry/monitoring” team to do that.” Turns out, it’s not as simple as that in most cases. In most environments, with perhaps the exception of very small organizations or networks, there is a dedicated team of engineers whose focus is strictly on simple network management protocol (SNMP) and the variety of tools that make monitoring possible. In some cases, the team may only be one or two people, but in some cases, the team may be much larger. Now, this also raises a different question: Is the team that ensures that monitoring is functioning properly also the team that actually does the monitoring? Again, the question has likely prompted the “are you a dummy?” sort of expression from the reader. In most cases, a network operations center (NOC) or similar team is responsible for watching for all of the alerts and important telemetry-related information that comes from the monitoring tools and systems. In some very unusual or small cases, the teams may be the same or very closely linked together. In others, perhaps not so much. But, in any of these cases, we’re still back to our original question, whose job is it to ensure that the alerts are being generated and getting where they are supposed to go?
Some readers likely think that it’s the responsibility of the “telemetry” team to ensure the alerts are being generated and being sent to and received in the appropriate places. Others may believe that in fact the NOC or other similar team is responsible. But actually, in reality, neither of those is the best answer. The author realizes that this position may likely generate some ire from a few colleagues, but it bears spelling out nonetheless. In fact, the system works best when the owner of the actual device, service, or application takes responsibility for ensuring that said device, service, or application is monitored in a fashion that allows the right teams to see the alert, understand the nature of the alert, and perform triage or escalation in a manner that best meets the needs of the business.
While the telemetry team should be considered the expert in telemetry, SNMP, and similar technologies, they aren’t the owner of the service. The same goes for a NOC or other like-oriented team. They simply perform their monitoring and triage (and possibly troubleshooting) duties on behalf of the device or system owner. No other team is better positioned to understand the application, network, or service than the actual owner. That puts the accountability squarely on the shoulders of the owner, and not the teams that help to facilitate monitoring on behalf of the owner.
This notion is not uncommon in our business, but it would represent a significant culture change for many of our organizations. It is a worthwhile exercise to take a hard look at the nature of monitoring and ensure that all roles, both accountable and responsible, are laid out clearly for everyone to understand. That’s the best way to ensure everyone is watching the important things that need to be watched, and that no one is assuming “the other guy” is the one doing the watching.
Patrick Hunter — “Hunter”
Senior Director, IT Systems Operation Center – Network/Security/Transport,
Hunter has been employed with Charter since 2000 and has held numerous positions, from Installer, System Technician, Technical Operations management, Sales Engineer, to Network Engineer. His responsibilities include providing IP connectivity to all users in Charter’s approximately 4,000 facilities, including executive and regional offices, technical centers, call centers, stores, headends, hubsites, and data centers. Hunter has served on the SCTE Gateway Chapter Board of Directors since 2005. He spends his spare time mentoring, teaching, and speaking on IP and Ethernet networks as well as careers in the network field.
shutterstock