Re: [BUG][4.5-rc5] rcu_shed self-detected stall on CPU - directly after network goes online.

From: Ian Kumlien
Date: Wed Feb 24 2016 - 17:54:24 EST


On 22 February 2016 at 01:38, Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote:
> Hi,
>
> When i tried to upgrade my, soon to be, firewall to 4.5-rc5 to do some
> testing - it deadlocked almost instantly.

After bisect, the offending patch seems to be:
b16c29191dc89bd877af99a7b04ce4866728a3e0

It looks like some basic sanity checking went missing...

The original patch does:
diff --git a/net/netfilter/nfnetlink_cttimeout.c
b/net/netfilter/nfnetlink_cttimeout.c
index 5d010f2..94837d2 100644
--- a/net/netfilter/nfnetlink_cttimeout.c
+++ b/net/netfilter/nfnetlink_cttimeout.c
@@ -307,12 +307,12 @@ static void ctnl_untimeout(struct net *net,
struct ctnl_timeout *timeout)

local_bh_disable();
for (i = 0; i < net->ct.htable_size; i++) {
- spin_lock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
+ nf_conntrack_lock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
if (i < net->ct.htable_size) {
hlist_nulls_for_each_entry(h, nn,
&net->ct.hash[i], hnnode)
untimeout(h, timeout);
}
- spin_unlock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
+ nf_conntrack_lock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
}
local_bh_enable();
}
---

Which looks like a mistake - the fix should be:
diff --git a/net/netfilter/nfnetlink_cttimeout.c
b/net/netfilter/nfnetlink_cttimeout.c
index 94837d2..2671b9d 100644
--- a/net/netfilter/nfnetlink_cttimeout.c
+++ b/net/netfilter/nfnetlink_cttimeout.c
@@ -312,7 +312,7 @@ static void ctnl_untimeout(struct net *net, struct
ctnl_timeout *timeout)
hlist_nulls_for_each_entry(h, nn,
&net->ct.hash[i], hnnode)
untimeout(h, timeout);
}
- nf_conntrack_lock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
+ spin_unlock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
}
local_bh_enable();
}
---

And it fixes my issue! ;)

> In the photo, i started writing "root" and it keeps repeating it, like
> it's in a while loop.
>
> https://goo.gl/photos/yGhNSogJjeb2VJyu5
>
> Trying to get better information - as in any - i enabled quite a few
> debugging options that could have any bearing on it and ended up with:
> https://goo.gl/photos/NnQER2WXXJ5ZWPR67
>
> The interesting part is that in this case the machine was booted in to
> single user mode and did not crash.
>
> It seems like it gets in to trouble when the bridges and the network
> interfaces are enabled, as in just about a second or two after boot.

[--8<--]
From caff3fec1641ba3e207ff705b68eba62dec3bef9 Mon Sep 17 00:00:00 2001
From: Ian Kumlien <ian.kumlien@xxxxxxxxx>
Date: Wed, 24 Feb 2016 23:40:57 +0100
Subject: [PATCH] netfilter: nf_conntrack: lock error

A lock error was introduced during the lock cleanup
lets undo that, =)

Signed-off-by: Ian Kumlien <ian.kumlien@xxxxxxxxx>
---
net/netfilter/nfnetlink_cttimeout.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nfnetlink_cttimeout.c b/net/netfilter/nfnetlink_cttimeout.c
index 94837d2..2671b9d 100644
--- a/net/netfilter/nfnetlink_cttimeout.c
+++ b/net/netfilter/nfnetlink_cttimeout.c
@@ -312,7 +312,7 @@ static void ctnl_untimeout(struct net *net, struct ctnl_timeout *timeout)
hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
untimeout(h, timeout);
}
- nf_conntrack_lock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
+ spin_unlock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
}
local_bh_enable();
}
--
2.7.2