connect() stalls after 4.9 -> 4.10 upgrade

From: Lutz Vieweg
Date: Mon Feb 27 2017 - 09:19:42 EST


Hi all,

the following regression a colleage and me experienced on two different
machines after upgrading from linux-4.9 to linux-4.10 is so obvious and
so easy to reproduce that I am surprised we could not find any reports
of it on the Internet:

After upgrading a machine running the latest CentOS from using mainline
kernel linux-4.9 to linux-4.10, attempts to connect() via IPv4 to localhost
(and seemingly also other hosts) fail in about half of the cases, leaving
the process trying to connect() stalled until timeout.

Reproduction:

> ncat -k -l 19999 &
> C=1 ; while true ; do echo -n "$C " ; echo ping | ncat localhost 19999 ; C=`expr $C + 1` ; sleep 1 ; done

Using linux-4.10, the output looks like this:

> 1 ping
> 2 Ncat: Connection timed out.
> 3 ping
> 4 Ncat: Connection timed out.
> 5 ping
> 6 ping
> 7 ping
> 8 Ncat: Connection timed out.
> 9 ping
> 10 ping
> 11 Ncat: Connection timed out.
> 12 ping
> 13 Ncat: Connection timed out.
> 14 ping
> 15 Ncat: Connection timed out.
> 16 ping
> 17 Ncat: Connection timed out.
> 18 ping
> 19 ping
> 20 Ncat: Connection timed out.
> 21 ping
> 22 ping
> 23 Ncat: Connection timed out.
> 24 ping
> 25 Ncat: Connection timed out.
> 26 Ncat: Connection timed out.
> 27 ping
> 28 Ncat: Connection timed out.
> 29 ping


Using linux-4.9, the output looks like this:

> 1 ping
> 2 ping
> 3 ping
> 4 ping
> 5 ping
> 6 ping
> 7 ping
> 8 ping
> 9 ping
> 10 ping
> 11 ping
> 12 ping
> 13 ping
> 14 ping
> 15 ping
> 16 ping
> 17 ping
> 18 ping
> 19 ping

The two machines we observed this symptom on were running
different distributions (CentOS 7.3 and Ubuntu) and were of
completely different hardware.

Didn't anybody else experience the same symptom?

(I also created https://bugzilla.kernel.org/show_bug.cgi?id=194723 on this topic.)

Regards,

Lutz Vieweg