[regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs going read-only
From: Linux regression tracking (Thorsten Leemhuis)
Date: Tue Oct 01 2024 - 14:57:03 EST
Hi, Thorsten here, the Linux kernel's regression tracker.
Basavaraj Natikar, I noticed a report about a regression in
bugzilla.kernel.org that appears to be caused by a change of yours:
2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is available")
[v6.9-rc1]
As many (most?) kernel developers don't keep an eye on the bug tracker,
I decided to write this mail. To quote from
https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
> I am getting bad page map errors on kernel version 6.9 or newer.
> They always appear within a few minutes of the system being on, if
> not immediately upon booting. My system is a Dell Inspiron 7405.
>
> This occurs with kernel 6.9.x, 6.10.x and 6.11. I tested a handful
> of versions from 6.2.x to 6.8.x as well as 5.15 and they don't have
> the same behavior. In addition to compiling from kernel.org, I tried
> to install some major distros (Fedora, CentOS, Debian, Mint, Ubuntu)
> to double check that it was not a mistake I was making with
> compilation. They were consistent with my kernel.org results.
>
> Kernel version from /proc/verison of the earliest affected release I
> could identify: Linux version 6.9.0 (skyler@nobara-pc) (gcc (GCC)
> 14.2.1 20240912 (Red Hat 14.2.1-3), GNU ld version 2.41-37.fc40) #1
> SMP PREEMPT_DYNAMIC Sat Sep 28 11:17:40 EDT 2024
>
> Please let me know if there is any other information or testing that
> could help debug this. This is my first time making a bug report or
> even compiling the kernel from source so I may be missing something
> obvious. Thank you!
>
> Attached is a full dmesg log. Below I will paste a few other dmesg
> snippets and some environment information.>
> dmesg sample #1:
>
> [ 23.234632] systemd-journald[611]: File /var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
> [ 23.580724] rfkill: input handler enabled
> [ 25.652067] rfkill: input handler disabled
> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover, sensors not enabled is 0
> [ 34.222379] pcie_mp2_amd 0000:03:00.7: amd_sfh_hid_client_init failed err -95
> [ 34.680264] BUG: unable to handle page fault for address: 00000002ffffffe3
> [ 34.680272] #PF: supervisor read access in kernel mode
> [ 34.680274] #PF: error_code(0x0000) - not-present page
> [ 34.680275] PGD 0 P4D 0
> [ 34.680278] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 34.680280] CPU: 3 PID: 3252 Comm: Chroot Helper Not tainted 6.9.0 #1
> [ 34.680282] Hardware name: Dell Inc. Inspiron 7405 2n1/0XMJN6, BIOS 1.19.0 07/10/2024
> [ 34.680284] RIP: 0010:unlink_anon_vmas+0x97/0x1e0
> [ 34.680288] Code: 83 c0 22 49 89 47 18 e8 a7 19 02 00 48 8b 43 10 4c 8d 63 10 49 89 df 48 83 e8 10 4d 39 ec 74 48 48 89 c3 4d 8b 77 08 48 89 ef <49> 8b 2e 48 39 fd 74 12 48 85 ff 0f 85 06 01 00 00 48 8d 7d 08 e8
> [ 34.680290] RSP: 0018:ffffb41842c2f918 EFLAGS: 00010246
> [ 34.680292] RAX: 0000000080000000 RBX: ffff98528ab2cb00 RCX: 0000000000000000
> [ 34.680293] RDX: ffff98528ab2cb10 RSI: ffff98528862b008 RDI: 0000000000000000
> [ 34.680294] RBP: 0000000000000000 R08: 000000000000000f R09: 0000000000000060
> [ 34.680296] R10: 0000000000400030 R11: 0000000000000004 R12: ffff98528ab2c010
> [ 34.680297] R13: ffff98525ce97060 R14: 00000002ffffffe3 R15: ffff98528ab2c000
> [ 34.680298] FS: 0000000000000000(0000) GS:ffff98553f780000(0000) knlGS:0000000000000000
> [ 34.680300] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 34.680301] CR2: 00000002ffffffe3 CR3: 000000014c630000 CR4: 0000000000350ef0
> [ 34.680302] Call Trace:
> [...]
See the ticket for more details and the bisection result. Skyler, the
reporter (CCed), later also added:
> Occasionally I will not get the usual bad page map error, but
> instead some BTRFS errors followed by the file system going read-only.
Note, we had and earlier regression caused by this change reported by
Chris Hixon that maybe was not solved completely:
https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@xxxxxxxxxxxxx/
Chris Hixon: do you still encounter errors, or was your issue
resolved/vanished somehow?
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
P.S.: let me use this mail to also add the report to the list of tracked
regressions to ensure it's doesn't fall through the cracks:
#regzbot introduced: 2105e8e00d
#regzbot title: HID: amd_sfh: Memory Errors / Page Faults / btrfs going
read-only
#regzbot from: Skyler <skpu@xxxxx>
#regzbot duplicate: https://bugzilla.kernel.org/show_bug.cgi?id=219331
#regzbot ignore-activity