ARM board lockups/hangs triggered by locks and mutexes

From: Rafał Miłecki
Date: Tue Aug 01 2023 - 18:10:47 EST


Hi,

Years ago I added support for Broadcom's BCM53573 SoCs. We released
firmwares based on Linux 4.4 (and later on 4.14) that worked almost
fine. There was one little issue we couldn't debug or fix: random hangs
and reboots. They were too rare to deal with (most devices worked fine
for weeks or months).

Recently I updated my stable kernel 5.4 and I started experiencing
stability issues on my own! After some uptime (usually from 0 to 20
minutes of close to zero activity) serial console hangs. I can't type
anything and I stop getting any messages. I've to wait about a minute
for watchdog to kick in and reboot device.

#####

I took that great chance and decided to track the regression.

Linux 5.4 stable branch worked stable up to the release v5.4.197.
Starting with v5.4.198 I started experiencing those stability issues. I
bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
with loops over idev->addr_list"):
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4460066eb2480b9e203c73755e12e2efc820a27e

With above commit reverted I was able to use stable 5.4 branch up to the
release v5.4.207. Starting with v5.4.208 it got unstable again. I
bisected it down to:
commit d0d583484d2e ("locking/refcount: Consolidate implementations of
refcount_t")
commit dab787c73f6e ("locking/refcount: Consolidate
REFCOUNT_{MAX,SATURATED} definitions")
commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
commit 809554147d60 ("locking/refcount: Improve performance of generic
REFCOUNT_FULL code")
commit 9c9269977f03 ("locking/refcount: Move the bulk of the
REFCOUNT_FULL implementation into the <linux/refcount.h> header")
commit 04bff7d7b808 ("locking/refcount: Remove unused
refcount_*_checked() variants")
commit 513b19a43bec ("locking/refcount: Ensure integer operands are
treated as signed")
commit 68b4ee68e8c8 ("locking/refcount: Define constants for
saturation and max refcount values")
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=809554147d609163cfbaf815c443c575b538a7ef
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=04bff7d7b8081c4bb2e8171be31d33df297eee5b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=513b19a43becee5f7af6d283bb9d3d241a8a21a8
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
(I didn't actually check above commits individually).

Reverting above locking/refcount commits worked fine for few releases:
up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
sleepable contexts").

Reverting that extra commit from v5.4.238 allows me to run Linux for
hours again (currently 3 devices x 6 hours and counting). So I need in
total 10+1 reverts from 5.4 branch to get a stable kernel.

#####

I'm clueless at this point. Is that possible kernel has some locking bug
I can hit only using this specific SoC? BCM53573s have a single ARM
Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
can think of is a slow arch timer running at 36,8 kHz.

I tried compiling kernel with:
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_DETECT_HUNG_TASK=y
CONFIG_WQ_WATCHDOG=y
but it didn't change or report anything.

Unfortunately enabling *any* of following options:
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
seems to make locksup/hangs go away. I tried for few hours.

Sadly I don't have access to JTAG or any low level debugging interface.

Does looking at commits I reported above give anyone a hint on what may
be going on maybe?

--
Rafał
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Linux version 5.4.238 (ubuntu@nat) (gcc version 8.4.0 (OpenWrt GCC 8.4.0 r15234+1-d89a7f0120)) #0 SMP Fri Jul 14 12:56:51 2023
[ 0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
[ 0.000000] CPU: div instructions available: patching division code
[ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[ 0.000000] OF: fdt: Machine model: Tenda AC9
[ 0.000000] earlycon: ns16550a0 at MMIO 0x18000300 (options '115200n8')
[ 0.000000] printk: bootconsole [ns16550a0] enabled
[ 0.000000] Memory policy: Data cache writealloc
[ 0.000000] Hit pending asynchronous external abort (FSR=0x00000c06) during first unmask, this is most likely caused by a firmware/bootloader bug.
[ 0.000000] percpu: Embedded 14 pages/cpu s26828 r8192 d22324 u57344
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 32480
[ 0.000000] Kernel command line: console=ttyS0,115200 earlycon
[ 0.000000] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes, linear)
[ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[ 0.000000] Memory: 121068K/131072K available (5309K kernel code, 151K rwdata, 1260K rodata, 1024K init, 281K bss, 10004K reserved, 0K cma-reserved, 0K highmem)
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[ 0.000000] rcu: Hierarchical RCU implementation.
[ 0.000000] rcu: RCU restricting CPUs from NR_CPUS=2 to nr_cpu_ids=1.
[ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[ 0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[ 0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[ 0.000000] arch_timer: cp15 timer(s) running at 0.03MHz (virt).
[ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x10eb00226, max_idle_ns: 56421785894076 ns
[ 0.000000] sched_clock: 56 bits at 35kHz, resolution 27918ns, wraps every 70368744165810ns
[ 0.008599] Ignoring delay timer 0xc0926268, which has insufficient resolution of 27918ns
[ 0.017281] Calibrating delay loop... 1790.77 BogoMIPS (lpj=8953856)
[ 0.083784] pid_max: default: 32768 minimum: 301
[ 0.088754] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.096320] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.105421] CPU: Testing write buffer coherency: ok
[ 0.111061] /cpus/cpu@0 missing clock-frequency property
[ 0.116645] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[ 0.123820] Setting up static identity map for 0x100000 - 0x10003c
[ 0.130493] rcu: Hierarchical SRCU implementation.
[ 0.136160] smp: Bringing up secondary CPUs ...
[ 0.140906] smp: Brought up 1 node, 1 CPU
[ 0.145038] SMP: Total of 1 processors activated (1790.77 BogoMIPS).
[ 0.151627] CPU: All CPU(s) started in SVC mode.
[ 0.158886] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[ 0.169132] futex hash table entries: 256 (order: 2, 16384 bytes, linear)
[ 0.176335] pinctrl core: initialized pinctrl subsystem
[ 0.182952] NET: Registered protocol family 16
[ 0.189066] DMA: preallocated 256 KiB pool for atomic coherent allocations
[ 0.235328] clocksource: Switched to clocksource arch_sys_counter
[ 0.242950] thermal_sys: Registered thermal governor 'step_wise'
[ 0.243341] NET: Registered protocol family 2
[ 0.254369] IP idents hash table entries: 2048 (order: 2, 16384 bytes, linear)
[ 0.262549] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 6144 bytes, linear)
[ 0.271288] TCP established hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.279217] TCP bind hash table entries: 1024 (order: 1, 8192 bytes, linear)
[ 0.286531] TCP: Hash tables configured (established 1024 bind 1024)
[ 0.293232] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[ 0.300044] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[ 0.307638] NET: Registered protocol family 1
[ 0.312161] PCI: CLS 0 bytes, default 64
[ 0.322686] workingset: timestamp_bits=14 max_order=15 bucket_order=1
[ 0.335808] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 0.341839] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[ 0.371572] Serial: 8250/16550 driver, 2 ports, IRQ sharing disabled
[ 0.386118] bgmac_bcma: Broadcom 47xx GBit MAC driver loaded
[ 0.392316] bcma-host-soc 18000000.axi: bus0: Found chip with id 53573, rev 0x02 and package 0x01
[ 0.401725] bcma-host-soc 18000000.axi: bus0: Core 0 found: ChipCommon (manuf 0x4BF, id 0x800, rev 0x36, class 0x0)
[ 0.412753] bcma-host-soc 18000000.axi: bus0: Core 1 found: IEEE 802.11 (manuf 0x4BF, id 0x812, rev 0x38, class 0x0)
[ 0.423809] bcma-host-soc 18000000.axi: bus0: Core 2 found: PCIe Gen 2 (manuf 0x4BF, id 0x501, rev 0x05, class 0x0)
[ 0.434725] bcma-host-soc 18000000.axi: bus0: Core 3 found: ARM CA7 (manuf 0x4BF, id 0x847, rev 0x00, class 0x0)
[ 0.445418] bcma-host-soc 18000000.axi: bus0: Core 4 found: USB 2.0 Host (manuf 0x4BF, id 0x819, rev 0x05, class 0x0)
[ 0.456558] bcma-host-soc 18000000.axi: bus0: Core 5 found: GBit MAC (manuf 0x4BF, id 0x82D, rev 0x08, class 0x0)
[ 0.467334] bcma-host-soc 18000000.axi: bus0: Core 6 found: I2S (manuf 0x4BF, id 0x834, rev 0x06, class 0x0)
[ 0.477636] bcma-host-soc 18000000.axi: bus0: Core 7 found: CNDS DDR2/3 memory controller (manuf 0x4BF, id 0x846, rev 0x00, class 0x0)
[ 0.490228] bcma-host-soc 18000000.axi: bus0: Core 8 found: NAND flash controller (manuf 0x4BF, id 0x509, rev 0x01, class 0x0)
[ 0.502177] bcma-host-soc 18000000.axi: bus0: Core 9 found: IEEE 802.11 (manuf 0x4BF, id 0x812, rev 0x38, class 0x0)
[ 0.513261] bcma-host-soc 18000000.axi: bus0: Core 10 found: GBit MAC (manuf 0x4BF, id 0x82D, rev 0x08, class 0x0)
[ 0.524121] bcma-host-soc 18000000.axi: bus0: Core 11 found: I2S (manuf 0x4BF, id 0x834, rev 0x06, class 0x0)
[ 0.534451] bcma-host-soc 18000000.axi: bus0: Core 12 found: GCI (manuf 0x4BF, id 0x840, rev 0x08, class 0x0)
[ 0.544837] bcma-host-soc 18000000.axi: bus0: Core 13 found: PMU (manuf 0x4BF, id 0x827, rev 0x1C, class 0x0)
[ 0.563487] bcma-host-soc 18000000.axi: bus0: Found M25P64 serial flash (size: 8192KiB, blocksize: 0x10000, blocks: 128)
[ 0.575771] printk: console [ttyS0] disabled
[ 0.580266] 18000300.serial: ttyS0 at MMIO 0x18000300 (irq = 20, base_baud = 2500000) is a 16550A
[ 0.589508] printk: console [ttyS0] enabled
[ 0.589508] printk: console [ttyS0] enabled
[ 0.598079] printk: bootconsole [ns16550a0] disabled
[ 0.598079] printk: bootconsole [ns16550a0] disabled
[ 0.629152] 4 bcm47xxpart partitions found on MTD device bcm47xxsflash
[ 0.635797] Creating 4 MTD partitions on "bcm47xxsflash":
[ 0.641297] 0x000000000000-0x000000040000 : "boot"
[ 0.647244] 0x000000040000-0x0000007b0000 : "firmware"
[ 0.653442] 2 trx partitions found on MTD device firmware
[ 0.658970] Creating 2 MTD partitions on "firmware":
[ 0.664023] 0x00000000001c-0x000000255c00 : "linux"
[ 0.670054] 0x000000255c00-0x000000770000 : "rootfs"
[ 0.676140] mtd: device 3 (rootfs) set to be root filesystem
[ 0.681919] 1 squashfs-split partitions found on MTD device rootfs
[ 0.688257] 0x000000510000-0x000000770000 : "rootfs_data"
[ 0.694762] 0x0000007b0000-0x0000007f0000 : "boot"
[ 0.700737] 0x0000007f0000-0x000000800000 : "nvram"
[ 0.854989] pcie_iproc_bcma bcma0:2: link: UP
[ 0.859623] pcie_iproc_bcma bcma0:2: PCI host bridge to bus 0000:00
[ 0.866045] pci_bus 0000:00: root bus resource [mem 0x10000000-0x17ffffff]
[ 0.873024] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
[ 0.881121] pci 0000:00:00.0: [14e4:d145] type 01 class 0x060400
[ 0.887263] pci_bus 0000:00: 2-byte config write to 0000:00:00.0 offset 0x4 may corrupt adjacent RW1C bits
[ 0.897146] pci 0000:00:00.0: PME# supported from D0 D3hot D3cold
[ 0.903986] PCI: bus0: Fast back to back transfers disabled
[ 0.909877] pci 0000:01:00.0: [14e4:a8db] type 00 class 0x028000
[ 0.916047] pci_bus 0000:01: 2-byte config write to 0000:01:00.0 offset 0x4 may corrupt adjacent RW1C bits
[ 0.925903] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00007fff 64bit]
[ 0.932827] pci 0000:01:00.0: enabling Extended Tags
[ 0.937992] pci 0000:01:00.0: supports D1 D2
[ 0.942961] PCI: bus1: Fast back to back transfers disabled
[ 0.948685] pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 01
[ 0.955469] pci 0000:00:00.0: BAR 8: assigned [mem 0x10000000-0x100fffff]
[ 0.962365] pci 0000:01:00.0: BAR 0: assigned [mem 0x10000000-0x10007fff 64bit]
[ 0.969819] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 0.974845] pci 0000:00:00.0: bridge window [mem 0x10000000-0x100fffff]
[ 0.982271] bgmac_bcma bcma0:5: Found PHY addr: 30 (NOREGS)
[ 0.989111] Broadcom B53 (2) bcma_mdio-0-0:1e: failed to register switch: -517
[ 0.996565] bgmac_bcma bcma0:5: Support for Roboswitch not implemented
[ 1.004271] bgmac_bcma bcma0:5: Timeout waiting for reg 0x1E0
[ 1.014601] bgmac_bcma bcma0:10: Found PHY addr: 30 (NOREGS)
[ 1.020464] bgmac_bcma bcma0:10: Support for Roboswitch not implemented
[ 1.028281] bgmac_bcma bcma0:10: Timeout waiting for reg 0x1E0
[ 1.039086] bcm47xx-wdt bcm47xx-wdt.0: BCM47xx Watchdog Timer enabled (30 seconds)
[ 1.046931] bcma-host-soc 18000000.axi: bus0: Bus registered
[ 1.052962] pci 0000:00:00.0: enabling device (0140 -> 0142)
[ 1.058797] bcma-pci-bridge 0000:01:00.0: enabling device (0140 -> 0142)
[ 1.065693] bcma-pci-bridge 0000:01:00.0: bus1: Found chip with id 43217, rev 0x00 and package 0x08
[ 1.075018] bcma-pci-bridge 0000:01:00.0: bus1: Core 0 found: ChipCommon (manuf 0x4BF, id 0x800, rev 0x27, class 0x0)
[ 1.085878] bcma-pci-bridge 0000:01:00.0: bus1: Core 1 found: IEEE 802.11 (manuf 0x4BF, id 0x812, rev 0x1E, class 0x0)
[ 1.096850] bcma-pci-bridge 0000:01:00.0: bus1: Core 2 found: PCIe (manuf 0x4BF, id 0x820, rev 0x14, class 0x0)
[ 1.128929] can not parse nvram name 0:ag2(null) with value 0xff got -34
[ 1.135797] can not parse nvram name 0:ag3(null) with value 0xff got -34
[ 1.215450] bcma-pci-bridge 0000:01:00.0: bus1: Bus registered
[ 1.222457] NET: Registered protocol family 10
[ 1.231392] Segment Routing with IPv6
[ 1.235356] NET: Registered protocol family 17
[ 1.239935] 8021q: 802.1Q VLAN Support v1.8
[ 1.244290] Registering SWP/SWPB emulation handler
[ 1.326344] b53_common: found switch: BCM53125, rev 4
[ 1.336841] VFS: Mounted root (squashfs filesystem) readonly on device 31:3.
[ 1.353034] Freeing unused kernel memory: 1024K
[ 1.365207] Run /sbin/init as init process
[ 1.751214] init: Console is alive
[ 1.755178] init: - watchdog -
[ 2.444636] kmodloader: loading kernel modules from /etc/modules-boot.d/*
[ 2.678011] usbcore: registered new interface driver usbfs
[ 2.683650] usbcore: registered new interface driver hub
[ 2.689262] usbcore: registered new device driver usb
[ 2.702049] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[ 2.713356] fsl-ehci: Freescale EHCI Host controller driver
[ 2.722011] ehci-platform: EHCI generic platform driver
[ 2.727762] ehci-platform 18004000.usb: EHCI Host Controller
[ 2.733541] ehci-platform 18004000.usb: new USB bus registered, assigned bus number 1
[ 2.741722] ehci-platform 18004000.usb: irq 23, io mem 0x18004000
[ 2.775057] ehci-platform 18004000.usb: USB 2.0 started, EHCI 1.00
[ 2.782567] hub 1-0:1.0: USB hub found
[ 2.787174] hub 1-0:1.0: 2 ports detected
[ 2.801468] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[ 2.810402] ohci-platform: OHCI generic platform driver
[ 2.816125] ohci-platform 1800d000.usb: Generic Platform OHCI controller
[ 2.822966] ohci-platform 1800d000.usb: new USB bus registered, assigned bus number 2
[ 2.831146] ohci-platform 1800d000.usb: irq 23, io mem 0x1800d000
[ 2.910156] hub 2-0:1.0: USB hub found
[ 2.914512] hub 2-0:1.0: 2 ports detected
[ 2.927550] kmodloader: done loading kernel modules from /etc/modules-boot.d/*
[ 2.955636] init: - preinit -
[ 3.710592] random: jshn: uninitialized urandom read (4 bytes read)
[ 3.766932] random: jshn: uninitialized urandom read (4 bytes read)
[ 4.059997] random: jshn: uninitialized urandom read (4 bytes read)
Failed to set attribute: Unspecific failure
[ 4.945753] bgmac_bcma bcma0:5: Timeout waiting for reg 0x1E0
[ 4.952342] bcma-host-soc 18000000.axi: bus0: No bus clock specified for D145 device, pmu rev. 28, using default 80000000 Hz
[ 4.964487] bgmac_bcma bcma0:5 eth0: Link is Up - 1Gbps/Full - flow control off
[ 4.972025] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 4.979032] IPv6: ADDRCONF(NETDEV_CHANGE): eth0.1: link becomes ready
[ 4.986431] random: procd: uninitialized urandom read (4 bytes read)
Press the [f] key and hit [enter] to enter failsafe mode
Press the [1], [2], [3] or [4] key and hit [enter] to select the debug level
[ 9.460606] mount_root: jffs2 not ready yet, using temporary tmpfs overlay
[ 9.472527] urandom-seed: Seed file not found (/etc/urandom.seed)
[ 9.555558] bgmac_bcma bcma0:5: Timeout waiting for reg 0x1E0
[ 9.606622] procd: - early -
[ 9.609693] procd: - watchdog -
[ 10.235105] procd: - watchdog -
[ 10.308532] procd: - ubus -
[ 10.337651] random: ubusd: uninitialized urandom read (4 bytes read)
[ 10.365765] random: ubusd: uninitialized urandom read (4 bytes read)
[ 10.372940] random: ubusd: uninitialized urandom read (4 bytes read)
[ 10.384806] procd: - init -
Please press Enter to activate this console.
[ 10.961248] urngd: jent-rng init failed, err: 2
[ 11.296024] kmodloader: loading kernel modules from /etc/modules.d/*
[ 11.374644] Loading modules backported from Linux version v5.10.168-0-g707c48210a53
[ 11.382573] Backport generated by backports.git v5.10.168-1-0-gd990acb2
[ 11.443827] xt_time: kernel timezone is -0000
[ 11.507593] PPP generic driver version 2.4.2
[ 11.516109] NET: Registered protocol family 24
[ 11.545870] b43-phy0: Broadcom 43217 WLAN found (core revision 30)
[ 11.555084] b43-phy0: Found PHY: Analog 9, Type 4 (N), Revision 17
[ 11.561365] b43-phy0: Found Radio: Manuf 0x17F, ID 0x2057, Revision 14, Version 1
[ 11.575381] Broadcom 43xx driver loaded [ Features: NL ]
[ 11.643112] kmodloader: done loading kernel modules from /etc/modules.d/*
[ 12.761544] random: jshn: uninitialized urandom read (4 bytes read)
[ 12.820453] random: jshn: uninitialized urandom read (4 bytes read)
[ 13.054441] random: jshn: uninitialized urandom read (4 bytes read)

Attachment: config.gz
Description: application/gzip