Thursday, May 01, 2014

Network Troubleshooting - Sometimes It's What You DON'T See...

I spend a healthy chunk of my typical work day analyzing network packet captures.  My primary tool is Wireshark, which humbly presents itself as "The World's Most Popular Network Protocol Analyzer."  (Seriously - if you aren't using Wireshark, go download it NOW.)  Protocol analyzers are great for identifying typical "red flags" in packet data, but they're all limited to what the raw data might indicate; customer network environments are so broad (and so varied) that the network engineer--especially one "on the outside looking in" with only a small data set--relies heavily on experience and intuition.

One recent case was presented as "many failed connections," and a 6-minute packet capture soon landed in my lap.  Now, every Wireshark user has their own approach; I usually take advantage of Wireshark's display filters to get a general "feel" for the incidence of Layer 3/4 problems. With a typical capture file, I'll start with tcp.analysis.flags,which simply tells Wireshark, "hey, show me what YOU think are TCP problems." Now, as I said, none of these tools are perfect, so take these results with a grain of salt; they're only as good as are the underlying data, and it's very easy to collect inaccurate or incomplete data. After taking a look at the results of this display filter, I noticed what seemed an high number of TCP retransmissions, so I decided to see exactly which packets were being retransmitted with a different display filter, tcp.analysis.retransmission, which will show me only those packets Wireshark believes to be TCP retransmissions. The resulting numbers were somewhat high, but I've seen worse. Now, the complaint was very specific that new connections were failing; no mention was made of existing connections being interrupted/terminated; so, I went to Wireshark's Statistics->Conversations dialog and sorted on the "Packets" column to look for very short conversations and found HUNDREDS of conversations that only lasted for a few packets, like these:
Well, now, wait just a minute - the TCP handshake requires 3 packets (SYN, SYN/ACK, ACK) to establish a conversation, and I'm seeing hundreds of conversations that are only exchanging 3 to 6 packets. After checking a few suspect conversations, I found a pattern, namely this:

So, the remote endpoint starts a conversation with a SYN packet and the local endpoint responds immediately, but we see the remote endpoint retransmitting its SYN packet within 10ms. The local endpoint retransmits its SYN/ACK, but neither the original nor the retransmitted SYN/ACK seem to reach the remote endpoint, and the conversation attempt is ultimately terminated with a TCP reset (RST) packet. Back I go to Wireshark's display, this time to ask about a very specific type of TCP retransmission:
tcp.analysis.retransmission && tcp.flags.syn==1 && !tcp.flags.ack==1
With this display filter, I'm asking Wireshark to show me all retransmitted SYN packets; the "!tcp.flags.ack==1" eliminates SYN/ACK packets from the display. The results were startling; within a 6-minute period, more than 110 endpoints had retransmitted more than 170 SYN packets...and all of them had failed to complete the TCP handshake.

Well, if conditions are this bad to START conversations, then there must be thousands of cases in which existing connections die before completing successfully, right?  Let's go back to Wireshark's Statistics->Conversations dialog and sort on Duration to look at long-lived conversations:
Hmm...I have hundreds of conversations that last longer than 2 minutes...but I can't find one that suffers from retransmissions sufficient to terminate the conversation.

If I were looking at a general network congestion issue on the local network, I'd expect conversations to suffer equally--packets are packets, right?--but this is something different. That seeming conflict in the data prompted what proved to be the key question:
If I'm seeing HUNDREDS of new conversations fail the TCP handshake due to excessive retransmissions, why DON'T I see established conversations suffering excessive retransmissions as well?
Well, after few moments' thought, it occurred to me that the only network devices that usually make specific distinctions between new and existing connections are those involved in network security. A brief conversation with the customer revealed that an intrusion protection system (IPS) was in place and "inspecting" conversations. When we conducted a test that bypassed the IPS, the incidence of failed TCP handshakes decreased by roughly 98%; our troubleshooting attention is now properly directed.

So, the moral of this story: Pay attention to the data, but pay equal attention to what isn't there.

No comments: