Mass-Ping | Graph-Mass-Ping


Mass-Ping

Mass-Ping

Graph-Mass-Ping


Mass-Ping

Mass-Ping accepts a text file listing nodes or a list of routes as input, identifies which devices are returning ICMP Echos, and then pings that list for a specificed window at a specified interval (typically once/second), producing, at the end, a CSV file recording the hits and misses for each device. The point is to record how a highly-available environment responds to failure: set mass-ping going, then reboot a redundant server/switch/router, wait until both failover and failback have occurred, and then stop the mass-ping run (mass-ping can be safely interrupted via Ctrl-C).

During a run, mass-ping emits output which consists of how many devices returned a ping each second -- this gives you real-time feedback on the impact of whatever change you are making. In this example, something happened in the middle there -- the number of stations returning pings dropped from 162 to ~144 for a minute or so. When it finishes, mass-ping produces a report summarizing what it saw.

Graph-Mass-Ping accepts as input the CSV file produced by mass-ping and produces a PNG file illustrating the output, to assist in visualizing the behavior of the highly-available devices.

Examples

In this example, I rebooted the 'a' switch in a data center, leaving the 'b' switch as the sole survivor. The CSV file records the raw data, while the PNG file illustrates which devices ignored the event (their 'b' side NICs kicked in appropriately) and which did not (e.g. elmo-storage, iron-b-svif2, isis-2-iscsi-1, and ja-a-rtr-v111042). The universally missed three pings at ~20:28:45 tracks the time which EIGRP and HSRP, in the routed infrastructure separating the monitoring station from this data center, require to detect the loss of a switch and to respond appropriately.
This is a similar example, showing the reboot of the 'a' switch and then of the 'b' switch.
In this situation, I rebooted the 'a' side of the redundant layer 3 core in our network border, while mass-ping watched various devices located outside our network. The initial missed three pings track the time which EIGRP and HSRP require to detect the failure and to respond appropriately. (The dozen or missed pings for ga-vpn track the failover time which an intervening firewall requires before responding to the loss of the 'a' router.) This diagram also illustrates the relatively disruptive effect of fail-back -- when the 'a' router finishes rebooting and reasserts its HSRP priority, the result is ~30 seconds of disrupted connectivity, followed several minutes later by varying disruptions to varying locations. I speculate that this tracks some sort of unpredictability in how the IPSec encryption engines, terminating site-to-site VPN tunnels, respond to the shifts in HSRP dominance.

Mechanics

Install

Installing mass-ping is not trivial, on account of its reliance on numerous Perl modules which are not included in Perl's core distribution. In addition, it only runs under *nix -- I have not invested the effort as yet to make it portable to Windows.

Options

Here is what help looks like once they are installed:

guru$ mass-ping --help
mass-ping v1.1.6

  Usage:  mass-ping -s {yes|no} [-c {"title"}] [-d {integer}] [-i {interval in seconds}] 
          [-t {timeout in seconds}] [-w {window in seconds}] [-m {report directory}] 
          [-n {report prefix}] [-o {owner}] [-g {group}] [-q {route,route,route...} | 
          -f {filename} | target1 target2 target3 ...] 

    -c specifies the title (enclose multi-word titles in quotes) which 
       will be included in the data file

    -d specifies debug level (typically 0-9)

    -f points the script to a file containing the list of targets

    -i specifies the interval between rounds of pings; default is 1 seconds

    -m specifies the directory into which we will dump the report file and the 
       date file.  The default is /home/netops/rpts/mass-ping
       It must already exist; we will not create it for you

    -n specifies the report prefix, a string prepending to the name of the 
       data file.  The default is empty

    -p defines the level of parallelism, a POE::Component::Client::Ping 
       parameter for the number of outstanding pings permitted.  The default 
       is 30

    -q specifies a list of routes in CIDR form (e.g. 10.1.2.0/23,10.2.20.0/23,10.15.30.0/23); 
       we will ping all host addresses on these routes

    -s asks the question:  are you serious?  If you answer 'no', then 
       the script will run in demo mode, making no changes

    -t specifies the time to wait before considering a ping lost; default 
       is 0.2 seconds

    -w specifies the window during which we will ping, i.e. the length of time 
       to elapse between starting and stopping.  The operator can interrupt 
       the program using Ctrl-C, whereupon the script will produce its report
       files with whatever data it has gathered up to that point.  The default 
       is 600 seconds

    target1 target2 target3 ... is the command-line list of targets you
      have specified, in lieu of the -a, -e, or -f options

guru%


guru$ graph-mass-ping --help
graph-mass-ping v1.1.2

  Usage:  graph-mass-ping -f {filename or directory name} [-d {debug level}] [-i {stamp interval}] [-n] [-t]

    -d debug level; 0 is default

    -f specifies an mass-ping data file or a directory containing 
       mass-ping data files.  Data files must end in '.csv'

    -i specifies how frequently we will engrave a time stamp.  30 seconds 
       is default

    -n tells us to (attempt) to resolve IP addresses in the data file first 
       to FQDN and, if that fails, then to host names

    -t tells us to sort the data file by node name

guru$


Last modified: 2017-04-28