Application stuck in TCP retransmit
I am running Linux kernel 3.13 (Ubuntu 14.04) on two Virtual Machines each of which operates inside two different servers running ESXi 5.1. There is a zeromq client-server application running between the two VMs. After running for about 10-30 minutes, this application consistently hangs due to inability to retransmit a lost packet. When I run the same setup over Ubuntu 12.04 (Linux 3.11), the application never fails (UPDATE : Also fails on 12.04 but takes longer) If you notice below, «ss» (socket statistics) shows 1 packet lost, sk_wmem_queued of 14110 (i.e. w14110) and a high rto (120000).
State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 **12350** 192.168.2.122:41808 192.168.2.172:55550 timer:(on,16sec,10) uid:1000 ino:35042 sk:ffff880035bcb100 skmem:(r0,rb648720,t0,tb1164800,f2274,**w14110**,o0,bl0) ts sack cubic wscale:7,7 rto:120000 rtt:7.5/3 ato:40 mss:8948 cwnd:1 ssthresh:21 send 9.5Mbps **unacked:1 retrans:1/10 lost:1** rcv_rtt:1476 rcv_space:37621
Since this has happened so consistently, I was able to capture the TCP log in wireshark. I found that the packet which is lost does get retransmitted and even acknowledged by the TCP in the other OS (the sequence number is seen in the ACK), but the sender doesn’t seem to understand this ACK and continues retransmitting. MTU is 9000 on both virtual machines and througout the route. The packets being sent are large in size. As I said earlier, this does not happen on Ubuntu 12.04 (kernel 3.11). So I did a diff on the TCP config options (seen via «sysctl -a |grep tcp «) between 14.04 and 12.04 and found the following differences. I also noticed that net.ipv4.tcp_mtu_probing=0 in both configurations. Left side is 3.11, right side is 3.13
> net.ipv4.tcp_early_retrans = 3 17c14 > net.ipv4.tcp_fastopen = 1 20d16 > net.ipv4.tcp_max_orphans = 4096 29,30c24,25 > net.ipv4.tcp_max_tw_buckets = 4096 >> net.ipv4.tcp_mem = 23352 31138 46704 34a30 >> net.ipv4.tcp_notsent_lowat = -1
My question to the networking experts on this forum : Are there any other debugging tools or options I can install/enable to dig further into why this TCP retransmit failure is occurring so consistently ? Are there any configuration changes which might account for this weird behaviour. UPDATE (for those who may hit a similar problem later): I was able to reproduce the problem on 3.11 as well and was then able to evade this problem by lowering the MTU. A similar problem has been reported here https://serverfault.com/questions/488893/how-do-i-prevent-tcp-connection-freezes-over-an-openvpn-network. The description given there matches what I saw :
«At some point with the Ubuntu clients, though, the remote end starts retransmitting the same TCP segment over and over (with the transmit delay increasing between each retransmission). The client sends what looks like a valid TCP ACK to each retransmission, but the remote end still continues to transmit the same TCP segment periodically.»
Counting TCP retransmissions
I would like to know if there is a way to count the number of TCP retransmissions that occurred in a flow, in LINUX. Either on the client side or the server side.
3 Answers 3
Looks like netstat -s solves my purpose.
It does, as long as you’re okay with getting the number of TCP retransmissions for the whole computer as opposed to a given TCP connection.
Also, this counter will wrap/overflow eventually — the speed depends on the amount of issues and traffic.
You can see TCP retransmissions for a single TCP flow using Wireshark. The «follow TCP stream» filter will allow you to see a single TCP stream. And the tcp.analysis.retransmission one will show retransmissions.
The Linux kernel provides an interface through the pseudo-filesystem proc for counters to track the TCPSynRetrans
* TCPSynRetrans This counter is explained by `kernel commit f19c29e3e391`_, I pasted the explanation below:: -- TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
You can also adjust these settings also through the pseudo-filesystem procfs but under the sys directory. There is a handy utility that does this short-hand for you.
sysctl -a | grep retrans net.ipv4.neigh.default.retrans_time_ms = 1000 net.ipv4.neigh.docker0.retrans_time_ms = 1000 net.ipv4.neigh.enp1s0.retrans_time_ms = 1000 net.ipv4.neigh.lo.retrans_time_ms = 1000 net.ipv4.neigh.wlp6s0.retrans_time_ms = 1000 net.ipv4.tcp_early_retrans = 3 net.ipv4.tcp_retrans_collapse = 1 net.ipv6.neigh.default.retrans_time_ms = 1000 net.ipv6.neigh.docker0.retrans_time_ms = 1000 net.ipv6.neigh.enp1s0.retrans_time_ms = 1000 net.ipv6.neigh.lo.retrans_time_ms = 1000 net.ipv6.neigh.wlp6s0.retrans_time_ms = 1000 net.netfilter.nf_conntrack_tcp_max_retrans = 3 net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
Application control of TCP retransmission on Linux
In my application, the server sends «heartbeats» to the client every now an then (30 seconds by default). A heartbeat is just a newline character that is sent as a response chunk. This is meant to keep the line busy so that we notify the connection loss.
There’s no problem when the client shuts down correctly. But when it’s shut down with force (the client machine loses power, for example), a TCP reset is not sent. In this case, the server sends a heartbeat, which the client doesn’t ACK. After this, the server keeps retransmitting the packet for roughly 15 minutes after giving up and reporting the failure to the application layer (our HTTP server). And 15 minutes is too long a wait in my case.
I can control the retransmission time by writing to the following files in /proc/sys/net/ipv4/ :
tcp_retries1 - INTEGER This value influences the time, after which TCP decides, that something is wrong due to unacknowledged RTO retransmissions, and reports this suspicion to the network layer. See tcp_retries2 for more details. RFC 1122 recommends at least 3 retransmissions, which is the default. tcp_retries2 - INTEGER This value influences the timeout of an alive TCP connection, when RTO retransmissions remain unacknowledged. Given a value of N, a hypothetical TCP connection following exponential backoff with an initial RTO of TCP_RTO_MIN would retransmit N times before killing the connection at the (N+1)th RTO. The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout. RFC 1122 recommends at least 100 seconds for the timeout, which corresponds to a value of at least 8.
The default value of tcp_retries2 is indeed 8, and my experience of 15 minutes (900 seconds) of retransmission is in line with the kernel documentation quoted above.
If I change the value of tcp_retries2 to 5 for example, the connection dies much more quicker. But setting it like this affects all the connections in the system, and I’d really like to set it for this one long polling connection only.
4.2.3.5 TCP Connection Failures Excessive retransmission of the same segment by TCP indicates some failure of the remote host or the Internet path. This failure may be of short or long duration. The following procedure MUST be used to handle excessive retransmissions of data segments [IP:11]: (a) There are two thresholds R1 and R2 measuring the amount of retransmission that has occurred for the same segment. R1 and R2 might be measured in time units or as a count of retransmissions. (b) When the number of transmissions of the same segment reaches or exceeds threshold R1, pass negative advice (see Section 3.3.1.4) to the IP layer, to trigger dead-gateway diagnosis. (c) When the number of transmissions of the same segment reaches a threshold R2 greater than R1, close the connection. (d) An application MUST be able to set the value for R2 for a particular connection. For example, an interactive application might set R2 to "infinity," giving the user control over when to disconnect. (e) TCP SHOULD inform the application of the delivery problem (unless such information has been disabled by the application; see Section 4.2.4.1), when R1 is reached and before R2. This will allow a remote login (User Telnet) application program to inform the user, for example.
It seems to me that tcp_retries1 and tcp_retries2 in Linux correspond to R1 and R2 in the RFC. The RFC clearly states (in item d) that a conforming implementation MUST allow setting the value of R2 , but I have found no way to do it using setsockopt() , ioctl() or such.
Another option would be to get a notification when R1 is exceeded (item e). This is not as good as setting R2 , though, as I think R1 is hit pretty soon (in a few seconds), and the value of R1 cannot be set per connection, or at least the RFC doesn’t require it.