Mass-Ping | Graph-Mass-Ping


Mass-Ping

Mass-Ping

Graph-Mass-Ping


Mass-Ping

Mass-Ping accepts a text file listing nodes or a list of routes as input, identifies which devices are returning ICMP Echos, and then pings that list for a specificed window at a specified interval (typically once/second), producing, at the end, a CSV file recording the hits and misses for each device. The point is to record how a highly-available environment responds to failure: set mass-ping going, then reboot a redundant server/switch/router, wait until both failover and failback have occurred, and then stop the mass-ping run (mass-ping can be safely interrupted via Ctrl-C).

Graph-Mass-Ping accepts as input the CSV file produced by mass-ping and produces a PNG file illustrating the output, to assist in visualizing the behavior of the highly-available devices.

Examples

In this example, I rebooted the 'a' switch in a data center, leaving the 'b' switch as the sole survivor. The CSV file records the raw data, while the PNG file illustrates which devices ignored the event (their 'b' side NICs kicked in appropriately) and which did not (e.g. elmo-storage, iron-b-svif2, isis-2-iscsi-1, and ja-a-rtr-v111042). The universally missed three pings at ~20:28:45 tracks the time which EIGRP and HSRP, in the routed infrastructure separating the monitoring station from this data center, require to detect the loss of a switch and to respond appropriately.
In this situation, I rebooted the 'a' side of the redundant layer 3 core in our network border, while mass-ping watched various devices located outside our network. The initial missed three pings track the time which EIGRP and HSRP require to detect the failure and to respond appropriately. (The dozen or missed pings for ga-vpn track the failover time which an intervening firewall requires before responding to the loss of the 'a' router.) This diagram also illustrates the relatively disruptive effect of fail-back -- when the 'a' router finishes rebooting and reasserts its HSRP priority, the result is ~30 seconds of disrupted connectivity, followed several minutes later by varying disruptions to varying locations. This tracks some sort of unpredictability in how the IPSec encryption engines, terminating site-to-site VPN tunnels, respond to the shifts in HSRP dominance.


Prepared by:
Stuart Kendrick

Last modified: 24-October-2009