Strange PF_NETLINK NETLINK_ROUTE stall: netlink: Fix autobind race condition that leads to zero port ID

From: Philipp Hahn
Date: Fri Nov 27 2015 - 10:32:56 EST


Hello,

I observed a strange stall in a multi-threaded process (originally
OpenLDAP slapd), which stalls during a call to getaddrinfo() after
upgrading the kernel from 3.16.x to 4.1.12:

> #0 0x00007f8e6d97b87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82
> #1 0x00007f8e6d99bfbc in make_request (fd=12, pid=2871, seen_ipv4=<optimized out>, seen_ipv6=<optimized out>, in6ai=<optimized out>, in6ailen=<optimized out>) at ../sysdeps/unix/sysv/linux/check_pf.c:119
> #2 0x00007f8e6d99c42a in __check_pf (seen_ipv4=0x7f8e6a0969ff, seen_ipv6=0x7f8e6a0969fe, in6ai=0x7f8e6a0969e0, in6ailen=0x7f8e6a0969d8) at ../sysdeps/unix/sysv/linux/check_pf.c:272
> #3 0x00007f8e6d959007 in *__GI_getaddrinfo (name=0x7f8e6a096a80 "0.0.0.0", service=0x400d04 "7636", hints=0x7f8e6a096a50, pai=0x7f8e6a096a48) at ../sysdeps/posix/getaddrinfo.c:2389

(The glibc tries to get the routing table to see via netlink, if it
should support IPv4 and/or IPv6. Neither
<https://sourceware.org/bugzilla/show_bug.cgi?id=12926> nor
<https://sourceware.org/git/?p=glibc.git;a=commit;h=636064eb4c03397c86aa26e489e68f952bd5e53f>
fixed the problem.)

I've been able to reproduce it by writing the attached program, which
calls getaddrinfo() from 10 thread 10^6 times. On my 8-core Intel I see
several threads hanging in the above call after some seconds - you don't
have to wait for the process to terminate):

> gdb --batch --ex 'set pagination off' --ex 'thread apply all bt' -p `pidof a.out` | grep rec
...
> #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82
> #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82
> #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82
> #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82
> #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82
> #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82
> #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82
> #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82
> #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82

Using "git bisect start v4.1.11 v.4.10" I've been able to identify
4e27762417669cb459971635be550eb7b5598286 (v4.1.9-26-g4e27762), which is
a back-port of 1f770c0a09da855a2b51af6d19de97fb955eca85
(v4.2-11228-g1f770c0). I've attached the log.

> # git log -1 4e27762417669cb459971635be550eb7b5598286
> commit 4e27762417669cb459971635be550eb7b5598286
> Author: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> Date: Fri Sep 18 19:16:50 2015 +0800
>
> netlink: Fix autobind race condition that leads to zero port ID
>
> [ Upstream commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ]
>
> The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
> Reset portid after netlink_insert failure") introduced a race
> condition where if two threads try to autobind the same socket
> one of them may end up with a zero port ID. This led to kernel
> deadlocks that were observed by multiple people.
>
> This patch reverts that commit and instead fixes it by introducing
> a separte rhash_portid variable so that the real portid is only set
> after the socket has been successfully hashed.
>
> Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
> Reported-by: Tejun Heo <tj@xxxxxxxxxx>
> Reported-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

I verified 4e27762417669cb459971635be550eb7b5598286~1 does not show the
problem.

v4.1.13 still has the problem.
> # git log --oneline v4.1..v4.1.13 -- net/netlink/af_netlink.c
> e962218 netlink: Trim skb to alloc size to avoid MSG_TRUNC
> d486236 netlink: Replace rhash_portid with bound
> 4e27762 netlink: Fix autobind race condition that leads to zero port ID
> 65d48c6 netlink, mmap: transform mmap skb into full skb on taps
> d397617 netlink: make sure -EBUSY won't escape from netlink_insert
> b265c30 netlink: don't hold mutex in rcu callback when releasing mmapd ring

v4.3 does not show the problem.
> # git log --oneline v4.2..v4.3 -- net/netlink/af_netlink.c
> 47191d6 netlink: fix locking around NETLINK_LIST_MEMBERSHIPS
> db65a3a netlink: Trim skb to alloc size to avoid MSG_TRUNC
> da314c9 netlink: Replace rhash_portid with bound
> 1f770c0 netlink: Fix autobind race condition that leads to zero port ID
> 1853c94 netlink, mmap: transform mmap skb into full skb on taps
> 6bb0fef netlink, mmap: fix edge-case leakages in nf queue zero-copy
> a66e365 netlink, mmap: don't walk rx ring on poll if receive queue non-empty
> 0ef7077 netlink: rx mmap: fix POLLIN condition
> 7084a31 netlink: mmap: fix lookup frame position
> 0a6a3a2 netlink: add NETLINK_CAP_ACK socket option

v4.2 does not show the problem (as expected, as the patch isn't in 4.2.0
> # git log --oneline v4.2..v4.2.6 -- net/netlink/af_netlink.c
> 326ce2c netlink: Trim skb to alloc size to avoid MSG_TRUNC
> e2a3131 netlink: Replace rhash_portid with bound
> 6e32e73 netlink: Fix autobind race condition that leads to zero port ID
> 62f43b5 netlink, mmap: transform mmap skb into full skb on taps ring

I no longer have an idea where to look next, as something outside of
af_netlink.c seems to have fixed the issue for at least 4.3. Any idea?

Kernel: 4.1.12 (Debian)
GLIBC: 2.13-38+deb7u8
CPU: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz

Thank you and a (hopefully) happy weekend.
Philipp
--
Philipp Hahn
Open Source Software Engineer

Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
hahn@xxxxxxxxxxxxx

http://www.univention.de/
GeschÃftsfÃhrer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876
/*
* <https://forge.univention.org/bugzilla/show_bug.cgi?id=40059>
*/
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <grp.h>
#include <string.h>
#include <limits.h>
#include <stdarg.h>
#include <ctype.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>

#define ARRAY_SIZE(a) ((sizeof (a)) / sizeof (*(a)))
#define BUFLEN 4096

// cyrus-sasl
int check(void) {
const char *addr = "0.0.0.0;7636";
char hbuf[NI_MAXHOST];
struct addrinfo hints, *ai = NULL;
int i, j;

/* Parse the address */
for (i = 0; addr[i] != '\0' && addr[i] != ';'; i++) {
if (i >= NI_MAXHOST)
return 3;
hbuf[i] = addr[i];
}
hbuf[i] = '\0';

if (addr[i] == ';')
i++;
/* XXX: Do we need this check? */
for (j = i; addr[j] != '\0'; j++)
if (!isdigit((int)(addr[j])))
return 2;

memset(&hints, 0, sizeof(hints));
hints.ai_family = PF_UNSPEC;
hints.ai_socktype = SOCK_STREAM;
hints.ai_flags = AI_PASSIVE | AI_NUMERICHOST;
if (getaddrinfo(hbuf, &addr[i], &hints, &ai) != 0)
return 1;

freeaddrinfo(ai);

return 0;
}

static pthread_t thread[10];

static void *run(void *data) {
unsigned int round, j;

unsigned int self = (unsigned int)(unsigned long)data;
pthread_t tid = pthread_self();

printf("[%d] POSIX thread ID is %u\n", self, (unsigned int)tid);

for (round = 0; round < 1000; round++) {
for (j = 0; j < 1000; j++) {
int i;
i = check();
if (i)
break;
}
printf("%d\n", round);

unsigned int usecs = (round * (unsigned int)tid) % 1000000;
usleep(usecs);
if (round % 10 == 0)
putchar('0' + (self % 10));
}

printf("[%d] end\n", self);
pthread_exit(NULL);
}

int main(int argc, char *argv[], char *envp[]) {
pid_t pid = getpid();
printf("pid=%d\n", pid);

unsigned int t;
for (t = 0; t < ARRAY_SIZE(thread); t++)
pthread_create(&thread[t], NULL, run, (void *)(long)t);

int ret = 0;
void *value_ptr;
for (t = 0; t < ARRAY_SIZE(thread); t++)
ret |= pthread_join(thread[t], &value_ptr);

return ret;
}
/* :!gcc -O0 -g -Wall -lpthread % */
git bisect start
# bad: [205a8514e63a221c3a5071f27521013444e43e5f] Linux 4.1.11
git bisect bad 205a8514e63a221c3a5071f27521013444e43e5f
# good: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1
git bisect good b953c0d234bc72e8489d3bf51a276c5c4ec85345
# good: [4b6bf170e83e666974ce00d7251dd0d4b5ca5439] ocfs2: fix shift left overflow
git bisect good 4b6bf170e83e666974ce00d7251dd0d4b5ca5439
# good: [0796f55ba4db268f3a6a6c6bb9e481ce9eaf5d48] ideapad-laptop: Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list
git bisect good 0796f55ba4db268f3a6a6c6bb9e481ce9eaf5d48
# bad: [dc7a3d707c9784f26048d6f059e7467a8a92e04b] spi: Fix documentation of spi_alloc_master()
git bisect bad dc7a3d707c9784f26048d6f059e7467a8a92e04b
# good: [8bb9225a7bd3a07aadd58b19a633f1f7a8884d01] net: eth: altera: fix napi poll_list corruption
git bisect good 8bb9225a7bd3a07aadd58b19a633f1f7a8884d01
# bad: [b6ac3aee1c73d5b107ae6ec6e3715b39db782781] iser-target: Put the reference on commands waiting for unsol data
git bisect bad b6ac3aee1c73d5b107ae6ec6e3715b39db782781
# bad: [4e27762417669cb459971635be550eb7b5598286] netlink: Fix autobind race condition that leads to zero port ID
git bisect bad 4e27762417669cb459971635be550eb7b5598286
# good: [cf9cf6bc253c5d83a8cb593d4ed7042bb4f04e7c] macvtap: fix TUNSETSNDBUF values > 64k
git bisect good cf9cf6bc253c5d83a8cb593d4ed7042bb4f04e7c
# good: [fce134644303a4ebe0892edc9af7d46e0ebda9ab] fib_rules: fix fib rule dumps across multiple skbs
git bisect good fce134644303a4ebe0892edc9af7d46e0ebda9ab
# good: [282117acdfac7b5d621b18f002a9751d7f583daf] net: phy: fixed_phy: handle link-down case
git bisect good 282117acdfac7b5d621b18f002a9751d7f583daf
# good: [d600176461e85a0bf3b963587e4ca92d494a7a47] mvneta: use inband status only when explicitly enabled
git bisect good d600176461e85a0bf3b963587e4ca92d494a7a47