Applications | Experts | Maps | Network Mgmt | OS | Packets | Problem Mgmt | Philosophy | SAN | Seminars

How Packets Work

Jumbo Frames
Host CPU
TCP Loses its Marbles
SSH Session Shutdown
MOSAIQ Stumble
ARP Poisoning
iSCSI Performance
How Packets Work
Warriors of the Net
Sniffing Primer
Chassis vs Stackable
MS NLB Catalyst Configuration
Forwarding Delay
Wireshark Columns

How Packets Work

Throughput, latency, jitter, frame loss, and other oddities on the wire.

2016 Reporting on pcaps: These scripts crawl pcaps, reporting on what they find using text file summaries and charts which map frequency against time.

2015 Performance and Jumbo Frames describes my understanding of how jumbo frames can enhance or degrade performance.

2015 First Encounters with the ProfiTap-1G describes my first experiences using a low-cost way to capture in-line: this tap forwards frames to your PC using USB.

2012 Network IO Affects Host CPU quantifies the effect of line rate traffic on host CPU, with and without VLAN tagging. Illustrative packet traces.

2011 TCP Loses its Marbles illustrates one of the ways in which a TCP stack can become confused. In this situation, two boxes are trading HL7 messages (HL7 is a protocol used to exchange information about patients -- information like Admit/Discharge/Transfer timing and lab results.) After a while, one box loses track of where it is in the conversation, emitting unbelievable numbers for TCP ACK and TCP SEQ.

2010 Predictable SSH Session Shutdown identifies which conversationalist, the client or the server, initiates the disconnect of a TCP session. In this situation, the client ('angie') ssh's to a server ('ingress') and types away. Sixty minutes later, the session disconnects, sometimes while the user is typing. The user replicated the experience three times in this trace and reported consistent behavior across the last several weeks.

What is causing the predictable session termination: the client, the server, or some device in between? In frame #3108, the client sends a TCP FIN to the server; the server acknowledges in frame #3109, and the client responds to the acknowledgement with a RST in frame #3110. I would spin the following story: the client SSH application is maintaining a timer which expires at sixty minutes; the client initiates the disconnect (#3108). The server's TCP stack acknowledges (#3109) receipt of the FIN and would normally proceed to send its own FIN ... but the client has already removed this connection from its tables and reacts with a RST (#3110). The server tears down its side of the TCP conversation (inferred: no packets accompany this step).

Alternatively, some intermediate device could be spoofing, minimally, the FIN. However, I see that the IP Ident number in frame #3108 is consistent with the previous frame (#3105), as is the TCP Window size and the IP TTL -- yes, an intermediate device might have tracked all three of these parameters and spoofed them as well, but I have yet to see a device behave in such a sophisticated way, so I discount this possibility. Plus, I happen to know that the sniffer was plugged into a SPAN port on the switch servicing the client, which leaves little room for an intermediate device to play in this game. There remains the possibility of a man-in-the-middle attack, which this trace cannot entirely discount.

Examining the client SSH application, I could see that it supported a configuration setting for automatically disconnecting the session after a specified number of seconds; however, the specified number was '0', which suggested to me that this feature was disabled. At that point, I wanted to install the application's service pack; my more experienced colleague suggested a simple uninstall/re-install, and this in fact fixed the issue: after uninstalling the application and re-installing it, the user spent the next day logged in without a single disconnect.

2010 MOSAIQ Stumbles on Dual Connections demonstrates a range of techniques, from collaborating with domain-specific colleagues providing key inputs (kudos to Josh, Matt, Stefani, and Robert), using unique strings to sync log output and packet traces, diagramming the environment, sketching how the various applications communicate, consulting with vendor tech support, and finally pulling together the story.

MOSAIQ is an application which controls the delivery of radiation to a tumor. The patient lies in a concrete and lead-shielded vault -- the door is nearly two feet thick. A therapist uses this application to control an electron beam which zaps the tumor. The application runs within a Citrix environment, hosted at a remote hospital, and consults numerous related applications. For example, some of the supporting applications contain three-dimensional images of the patient. Another contains the treatment plan developed by the medical staff. Yet another talks to the linear accelerator producing the beam, directing its movements.

In this environment, the application was occasionally hanging -- sometimes a few times a day, sometimes a few times a week, sometimes not at all. When the application hung, the linear accelerator shut down -- a fail-safe mechanism triggered by the loss of the controlling application. The therapist must restart the application, reposition the beam equipment, and then pick up treatment where s/he had left off -- not a dangerous procedure, but a tedious one, particularly for the patient, who has to continue lying there in the vault.

Trouble-shooting this plus related issues spanned a year of my time (several years of the team's time, prior to my joining the group.) In the end, we developed a model in which a human disconnected from one Citrix session and started another. The new session should have failed because a key component, called ClientVMI, can only listen to one MOSAIQ instance at a time: ClientVMI's developers believed that they had coded it such that it would refuse subsequent connection attempts. ClientVMI controls the movements of the electronic beam; only one therapist can drive it at a time. However, under some circumstances, ClientVMI will accept a second session. After a while, the orphaned instance of MOSAIQ disconnects its session with ClientVMI; whereupon ClientVMI disconnects the active MOSAIQ session. The active instance tries repeatedly to restart the session, but ClientVMI rejects these efforts. Eventually, MOSAIQ gives up and hangs. Restarting ClientVMI plus MOSAIQ resolves the issue. The long term fix? Reporting the issue to the developers and then installing the 'fat' client (removing Citrix from the mix), thus limiting ClientVMI to a single controlling MOSAIQ instance.

2010 Using Netstat. This Windows 2003 file server mounted its storage via iSCSI. Every few days, it logged messages like:

iScsiPrt Error The initiator could not send an iSCSI PDU
iScsiPrt Error The connection to the target was lost
iScsiPrt Error Target did not respond in time for a SCSI request

A few hours after a reboot, and shortly after another spate of iSCSI errors, we captured the following output:

C:\temp>netstat -e
Interface Statistics

                           Received            Sent

Bytes                     518260256      3535891165
Unicast packets           157057201        63057237
Non-unicast packets         2061365            2718
Discards                          0               0
Errors                            0            8482
Unknown protocols           2895300

8482 Ethernet errors isn't a lot, particularly when expressed as a percentage of the total Sent frames (8482 / 63057237 =~ 10-4). But, what if those ~8000 bad frames were emitted in a burst? Perhaps ~8000 bad frames in a row could interrupt communication enough to trigger SCSI timeouts. We replaced the NICs (Broadcom) with Intel ones; no more Ethernet errors, and no more iSCSI errors. (If that hadn't worked, we would have swapped cables and then Ethernet ports.)

2010 Using Netstat. This XP station runs software controlling an Illumina Sequencer, an instrument which gradually generates a terrabyte of data every few days, comprised of numerous files, which a background job gradually spools to a file server. The spooling process was falling behind the data generation process, which meant that the local disk would fill and thus halt the run, degrading data quality and wasting thousands of dollars of reagents.

C:\temp>netstat -s IPv4 Statistics [...] Received Packets Discarded = 2 [...] Discarded Output Packets = 0 [...] [...] TCP Statistics for IPv4 [...] Segments Received = 820839121 Segments Sent = 296809969 Segments Retransmitted = 1500299028 [...] C:\temp>

Notice the Segments Retransmitted counter, the extraordinary ratio of retransmitted TCP frames when compared to frames sent ... on average, this TCP stack had to send a frame five (5) times for every successful delivery. We suspect that the NIC suffered 'bursts' of frame creation problems, attempting and failing to create the TCP frame hundreds or perhaps thousands of times in a row for a single frame, before returning to normal behavior. A packet trace showed numerous periods during which the client fell silent for several tenths of a second. We replaced the NIC (Broadcom) with an Intel; the Segments Retransmitted counter dropped dramatically (but not to zero -- turns out the server receiving the data stumbles regularly). And file transfer performance now exceeds the rate at which data is generated, so local storage no longer overflows.

2009 ARP Poisoning illustrates the classic vulnerability which hosts have to the ARP protocol. The problem presented itself as intermittent file transfer failures, with some hosts affected and others not. In this example, 42.102 (client) successfully copies bytes to 43.150 (server) for a while but then encounters trouble: here, in frame #260, the client realizes that the server hasn't acknowledged the bytes in frame #250 and therefore retransmits. As the trace progresses, the server remains silent, not acknowledging the data in frame #250 nor responding to pings (a separate process on the client was emitting pings in parallel with the file copy). Filtering on this TCP stream, we see that the server picks up the conversation in frame #466, acknowledging the byte stream through frame #264 ... but over a hundred seconds have elapsed since its last transmission, and the client has given up and removed this TCP conversation from its tables. The client replies to frame #466 with a TCP RST in frame #467.

Why did the server go out to lunch for over a hundred seconds? Logging into the server and dumping its ARP cache across time, we can see a problem:

server> arp -a
aggr2          00:06:5b:fe:a0:e8
aggr1          00:06:5b:fe:a0:e7
server> arp -a
aggr2          00:06:5b:fe:a0:e8
aggr1          00:18:8b:30:bb:a8

Sometimes, the server correctly lists the client's MAC addresses ... but sometimes not. [The client contains two highly-available NICs, configured in an active/active fashion.]

Turning the sniffer toward the station which owns 00:18:8b:30:bb:a8, we see ARP Salad: notice the many IP addresses listed in the 'Tell a.b.c.d' sections.

What was going on? It turns out that some versions (3.x) of Broadcom drivers, when configured in an active/active team, start emitting this kaleidescope of unicast ARP Requests, claiming many IP addresses: that's what 00:18:8b:30:bb:a8 was doing. Per RFC856, the receiving stations glean MAC Address <==> IP Address mappings from these requests, poisoning their ARP caches. All along, the server was in fact responding to the client ... but at times addressing those frames to the wrong MAC addresses. Upgrading the NIC drivers on these rogue stations (we had a handful of them) fixed the problem.

2009 iSCSI Performance illustrates a class of issues popular in my environment. Essentially, the issue starts off looking complex and ends up being a case of not enough spindles. In this one, a DR Exchange server copied databases over the wire via iSCSI nightly to a local volume, then performed database validation -- and the process took all night plus much of the following day. Dividing the size of the databases to be copied by the time required gave an average throughput in the 10-12 MB/s range; Perfmon graphs supported this. The iSCSI target was hosted on a NetApp with a demonstrated ability to deliver wirespeed (~115MB/s, aka 1Gbps) throughput on random reads; the iSCSI client was a high-end PC equipped with two pairs of 1GigE NICs, one pair in the same (private) IP space as the NetApp, the other pair in the commodity IP space. We were expecting significantly greater throughput, given the high-end source storage, the all GigE network, and the high-end client.

The common elements I see in these cases are: (a) complex installation (multi-homed host, multiple protocols, multiple applications, multiple vendors), (b) novel technologies (iSCSI, Exchange, multi-homing) drawing attention away from the core issue. Packet traces illustrated normal behavior as well as pauses which, when summed together, accounted for considerable delay; on one particular trace containing ~226,000 frames running for ~33 seconds, 45 of those frames (SCSI Read commands issued from the client) appeared only after a pause of ~.3 seconds, i.e. more than a third of the trace time was consumed by waiting for the client to request more data from the NetApp. Surveying the traces uncovered no sign of TCP Retransmits nor of other pathology. (The 'TCP Incorrect Checksum' warnings arose because the trace was taken using Wireshark hosted on the client. Traces taken on the NetApp side showed no pathology either.) Graphing TCP Sequence numbers showed the expected plateau in sequence number growth during the slow periods, while graphing TCP Window Size showed no indication that window size exhaustion was slowing the exchange; and graphing round-trip latency showed no significant effect from server or network delays. Finally, graphing disk latency revealed that the receiving volume on the client was struggling to keep up with writes, steadily experiencing write latency in the ~150ms range. The receiving volume consisted of eight Seagate Cheetah ST3146854LC drives (15K SCSI) configured in a RAID-10 volume.

In fact, the application experiencing the slowness was a simple Robocopy, run via a batch file, copying a few large files from the remote (NetApp-hosted) volume to the local (eight Cheetah drive) volume. In the end, all the complexity around multi-homing, iSCSI, and Exchange was mere distraction: the client did not have enough spindles to deliver the write performance which we were wanting.

2005 BlueHeat describes how we learned to drive a filer into the ground by overrunning its NVRAM cache -- nothing like solving a problem in the heat of the moment to burn understanding into the brain. Managing this filer was our first foray into mass storage. Intermittently, the thousand plus users on it, deployed across multiple storage groups backed by multiple isolated (or so we thought) trays, experienced application crashes, all together. Comparing packet traces of normal versus pathological behavior pointed toward the filer. During two separate outage windows, we replicated the problem, originally graphing many variables together, looking for correlations, before refining our understanding to the overflow of the shared NVRAM cache. Kudos to Robert McDermott for capturing, graphing, and correlating the various filer parameters, along with identifying the key variable. In a related effort, we also benchmarked ways to backup a filer, illustrating the difficulties which CIFS has in achieving high throughput, along with the effecct of shoe-shining on tapes. And I sniffed on an EtherChannel in order to uncover a problem with an LACP configuration between Titan and Catalyst.

2004 WinCati Event describes how we analyzed and resolved an interaction between an EFI Fiery Server ZX (print server) and a vertical market application called WinCati.

2003 IP VTC Event describes how we analyzed and resolved an intermittent disconnect issue affecting our Polycom video-conferencing units.

How Packets Work is a document I wrote to support in-house seminars which I've orchestrated. It has evolved substantially over the last decade, as my understanding of this subject has increased. The actual seminar wants props and a lively audience, in addition to this document. Cisco offers a flash page which illustrates many of these functions, as part of a larger page describing layer 2 functions in general.

Warriors of the Net presents a high-level visualization of how packets traverse networks. The movie showcased within this site comes in various flavors -- at the high-end, it consists of a 140Mb MPG file. Ericson, the global communications manufacturer, produced and published this movie. From a technical point of view, this production contains numerous errors. On the other hand, it offers a four dimensional visualization of how packets move, the only such visualization I've ever seen, so until I find something better, I keep pointing people to it.

In Sniffing Primer, I outline options for our department as we consider how to enhance our packet capture and analysis capabilities. More interesting for the general reader, I think, is the 'Sniffing Primer' section, in which I describe the various ways to insert a packet sniffer into a switched/routed infrastructure.

Chassis versus Stackable illustrates one aspect of the debate between these two approaches to populating IDFs. Naturally, the prices illustrated here are dated -- but conceptually, the graph illustrates that at some level of density, chassis' become more cost-effective than stackables, as far as up front cost goes. The more subtle aspects to this decision come from less quantifiable sources: hardware reliability, maintenance costs, functionality.

Microsoft Network Load Balancing and Cisco Catalyst Configuration describes how to configure Cisco Catalyst switches to support this Microsoft product.

Measuring Forwarding Delay Across A Campus Network records measurements of forwarding delay across typical electronics (Catalyst 4000/4500, Catalyst 6500) deployed on our network.

Wireshark Columns illustrates the columns I display when I'm using a wide monitor.

A Teaser for OmniPeek illustrates some of this analyzer's features.


In this section, I stash links to educational material at other sites.

Cool Tools

Here I stash links to software I find useful in analyzing client/server issues.

Prepared by:
Stuart Kendrick

Last modified: 2016-November-25