Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs going read-only

From: Chris Hixon
Date: Tue Oct 01 2024 - 17:40:55 EST


Hi,

On 10/1/2024, 12:56:49 PM, "Linux regression tracking (Thorsten Leemhuis)" wrote:
> Hi, Thorsten here, the Linux kernel's regression tracker.
>
> Basavaraj Natikar, I noticed a report about a regression in
> bugzilla.kernel.org that appears to be caused by a change of yours:
>
> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is available")
> [v6.9-rc1]
>
> As many (most?) kernel developers don't keep an eye on the bug tracker,
> I decided to write this mail. To quote from
> https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
>
>> I am getting bad page map errors on kernel version 6.9 or newer.
>> They always appear within a few minutes of the system being on, if
>> not immediately upon booting. My system is a Dell Inspiron 7405.
>>
>> This occurs with kernel 6.9.x, 6.10.x and 6.11. I tested a handful
>> of versions from 6.2.x to 6.8.x as well as 5.15 and they don't have
>> the same behavior. In addition to compiling from kernel.org, I tried
>> to install some major distros (Fedora, CentOS, Debian, Mint, Ubuntu)
>> to double check that it was not a mistake I was making with
>> compilation. They were consistent with my kernel.org results.
>>
>> Kernel version from /proc/verison of the earliest affected release I
>> could identify: Linux version 6.9.0 (skyler@nobara-pc) (gcc (GCC)
>> 14.2.1 20240912 (Red Hat 14.2.1-3), GNU ld version 2.41-37.fc40) #1
>> SMP PREEMPT_DYNAMIC Sat Sep 28 11:17:40 EDT 2024
>>
>> Please let me know if there is any other information or testing that
>> could help debug this. This is my first time making a bug report or
>> even compiling the kernel from source so I may be missing something
>> obvious. Thank you!
>>
>> Attached is a full dmesg log. Below I will paste a few other dmesg
>> snippets and some environment information.>
>> dmesg sample #1:
>>
>> [ 23.234632] systemd-journald[611]: File /var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
>> [ 23.580724] rfkill: input handler enabled
>> [ 25.652067] rfkill: input handler disabled

>> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover, sensors not enabled is 0
>> [ 34.222379] pcie_mp2_amd 0000:03:00.7: amd_sfh_hid_client_init failed err -95

No sensors detected - do we all have that in common?

>> [ 34.680264] BUG: unable to handle page fault for address: 00000002ffffffe3
>> [ 34.680272] #PF: supervisor read access in kernel mode
>> [ 34.680274] #PF: error_code(0x0000) - not-present page
>> [ 34.680275] PGD 0 P4D 0
>> [ 34.680278] Oops: 0000 [#1] PREEMPT SMP NOPTI
>> [ 34.680280] CPU: 3 PID: 3252 Comm: Chroot Helper Not tainted 6.9.0 #1
>> [ 34.680282] Hardware name: Dell Inc. Inspiron 7405 2n1/0XMJN6, BIOS 1.19.0 07/10/2024
>> [ 34.680284] RIP: 0010:unlink_anon_vmas+0x97/0x1e0
>> [ 34.680288] Code: 83 c0 22 49 89 47 18 e8 a7 19 02 00 48 8b 43 10 4c 8d 63 10 49 89 df 48 83 e8 10 4d 39 ec 74 48 48 89 c3 4d 8b 77 08 48 89 ef <49> 8b 2e 48 39 fd 74 12 48 85 ff 0f 85 06 01 00 00 48 8d 7d 08 e8
>> [ 34.680290] RSP: 0018:ffffb41842c2f918 EFLAGS: 00010246
>> [ 34.680292] RAX: 0000000080000000 RBX: ffff98528ab2cb00 RCX: 0000000000000000
>> [ 34.680293] RDX: ffff98528ab2cb10 RSI: ffff98528862b008 RDI: 0000000000000000
>> [ 34.680294] RBP: 0000000000000000 R08: 000000000000000f R09: 0000000000000060
>> [ 34.680296] R10: 0000000000400030 R11: 0000000000000004 R12: ffff98528ab2c010
>> [ 34.680297] R13: ffff98525ce97060 R14: 00000002ffffffe3 R15: ffff98528ab2c000
>> [ 34.680298] FS: 0000000000000000(0000) GS:ffff98553f780000(0000) knlGS:0000000000000000
>> [ 34.680300] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 34.680301] CR2: 00000002ffffffe3 CR3: 000000014c630000 CR4: 0000000000350ef0
>> [ 34.680302] Call Trace:
>> [...]
>
> See the ticket for more details and the bisection result. Skyler, the
> reporter (CCed), later also added:
>
>> Occasionally I will not get the usual bad page map error, but
>> instead some BTRFS errors followed by the file system going read-only.
>
> Note, we had and earlier regression caused by this change reported by
> Chris Hixon that maybe was not solved completely:
> https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@xxxxxxxxxxxxx/
>

This looks like the same issue I reported.

> Chris Hixon: do you still encounter errors, or was your issue
> resolved/vanished somehow?

I still encounter errors with every kernel/patch I've tested. I've blacklisted
the amd_sfh module as a workaround, but when the module is inserted, a crash
similar to those reported will happen soon after the (45 second?)
detection/initialization timeout. It seems to affect whatever part of the
kernel next becomes active. I've had disk corruption as well, when BTRFS is
affected by the memory corruption, so I've ended up testing on a USB stick I
can reformat if necessary. I haven't tested new patches/kernels in a while
though. I'll get back to you after I've tried the latest mainline. Also note
that I've tried Fedora Rawhide's debug kernel, which has a ton of debugging
options including KASAN, but nothing seems to point the finger at something
originating in amd_sfh code. Is it possible the hardware itself (the mp2/sfh
chip) is corrupting memory somehow after some misstep in
initialization/de-initialization? Also if you look at my report, you'll see I
have no devices/sensors detected by amd_sfh - I wonder if other reporters all
have this in common? (noted in dmesg output above from another user)

>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> P.S.: let me use this mail to also add the report to the list of tracked
> regressions to ensure it's doesn't fall through the cracks:
>
> #regzbot introduced: 2105e8e00d
> #regzbot title: HID: amd_sfh: Memory Errors / Page Faults / btrfs going
> read-only
> #regzbot from: Skyler <skpu@xxxxx>
> #regzbot duplicate: https://bugzilla.kernel.org/show_bug.cgi?id=219331
> #regzbot ignore-activity
>