Re: ARM board lockups/hangs triggered by locks and mutexes

From: Rafał Miłecki
Date: Wed Aug 02 2023 - 03:02:23 EST


On 2.08.2023 00:25, Florian Fainelli wrote:
Hi Rafal,

On 8/1/23 15:10, Rafał Miłecki wrote:
Hi,

Years ago I added support for Broadcom's BCM53573 SoCs. We released
firmwares based on Linux 4.4 (and later on 4.14) that worked almost
fine. There was one little issue we couldn't debug or fix: random hangs
and reboots. They were too rare to deal with (most devices worked fine
for weeks or months).

Recently I updated my stable kernel 5.4 and I started experiencing
stability issues on my own! After some uptime (usually from 0 to 20
minutes of close to zero activity) serial console hangs. I can't type
anything and I stop getting any messages. I've to wait about a minute
for watchdog to kick in and reboot device.

#####

I took that great chance and decided to track the regression.

Linux 5.4 stable branch worked stable up to the release v5.4.197.
Starting with v5.4.198 I started experiencing those stability issues. I
bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
with loops over idev->addr_list"):
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4460066eb2480b9e203c73755e12e2efc820a27e

With above commit reverted I was able to use stable 5.4 branch up to the
release v5.4.207. Starting with v5.4.208 it got unstable again. I
bisected it down to:
commit d0d583484d2e ("locking/refcount: Consolidate implementations of
refcount_t")
commit dab787c73f6e ("locking/refcount: Consolidate
REFCOUNT_{MAX,SATURATED} definitions")
commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
commit 809554147d60 ("locking/refcount: Improve performance of generic
REFCOUNT_FULL code")
commit 9c9269977f03 ("locking/refcount: Move the bulk of the
REFCOUNT_FULL implementation into the <linux/refcount.h> header")
commit 04bff7d7b808 ("locking/refcount: Remove unused
refcount_*_checked() variants")
commit 513b19a43bec ("locking/refcount: Ensure integer operands are
treated as signed")
commit 68b4ee68e8c8 ("locking/refcount: Define constants for
saturation and max refcount values")
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=809554147d609163cfbaf815c443c575b538a7ef
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=04bff7d7b8081c4bb2e8171be31d33df297eee5b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=513b19a43becee5f7af6d283bb9d3d241a8a21a8
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
(I didn't actually check above commits individually).

Reverting above locking/refcount commits worked fine for few releases:
up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
sleepable contexts").

Reverting that extra commit from v5.4.238 allows me to run Linux for
hours again (currently 3 devices x 6 hours and counting). So I need in
total 10+1 reverts from 5.4 branch to get a stable kernel.

#####

I'm clueless at this point. Is that possible kernel has some locking bug
I can hit only using this specific SoC? BCM53573s have a single ARM
Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
can think of is a slow arch timer running at 36,8 kHz.

From the look of it, it seems like the CPU might have bugs with atomics?

Your log indicates that your Cortex-A7 is r0p5 which is described to be susceptible to ARM_ERRATA_814220, do you have it enabled by any chance, if not, can you enable it and see if makes any difference?

I had it disabled. Unfortunately CONFIG_ARM_ERRATA_814220=y doesn't help.