[REGRESSION] soft lockup on boot starting with kernel 6.10 / commit 5186ba33234c9a90833f7c93ce7de80e25fac6f5

From: Hugues Bruant
Date: Mon Sep 09 2024 - 02:54:17 EST


Hi,

I have discovered a 100% reliable soft lockup on boot on my laptop:
Purism Librem 14, Intel Core i7-10710U, 48Gb RAM, Samsung Evo Plus 970
SSD, CoreBoot BIOS, grub bootloader, Arch Linux.

The last working release is kernel 6.9.10, every release from 6.10
onwards reliably exhibit the issue, which, based on journalctl logs,
seems to be triggered somewhere in systemd-udev:
https://gitlab.archlinux.org/-/project/42594/uploads/04583baf22189a0a8bb2f8773096e013/lockup.log

Bisect points to commit 5186ba33234c9a90833f7c93ce7de80e25fac6f5

At a glance, I see two potentially problematic changes in this diff.
Specifically, in the refactoring to move the call to rdt_ctrl_update
inside the loop that walks over r->domains :

1. the change from on_each_cpu_mask to smp_call_function_any means
that preemption is no longer disabled around the call to
rdt_ctrl_update, which could plausibly be a problem

2. there's now a race condition on the msr_params struct: afaict
there's no write barrier, so if the call to rdt_ctrl_update is
executed on a different CPU, it could plausibly read an outdated value
of the dom field, which prior to this series of patches wasn't passed
as an explicit parameter, but derived inside rdt_ctrl_update

For initial report to Arch Linux bugtracker and bisect log see:
https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/issues/74

Best
Hugues