Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only]

From: Chris Hixon
Date: Thu Oct 03 2024 - 13:50:03 EST



On 10/2/2024, 6:29:59 AM, "Linux regression tracking (Thorsten Leemhuis)" wrote:
> [CCing Richard, who apparently faces the same problem according to a
> recent comment in the bugzilla ticket mentioned earlier:
> https://bugzilla.kernel.org/show_bug.cgi?id=219331#c8
>
> CCing Mario, who might be interested in this and is a good contact when
> it comes to issues with AMD stuff like this.
>
> CCing the Btrfs list as JFYI, as all three reporters afaics see Btrfs
> misbehavior or corruptions due to this.
>
> Considered to bring Linus in, but decided to wait a bit before doing so.]

This patch from Basavaraj Natikar seems to solve the issue for me:

https://lore.kernel.org/linux-input/20241003160454.3017229-1-Basavaraj.Natikar@xxxxxxx/

Tested-by: Chris Hixon <linux-kernel-bugs@xxxxxxxxxxxxx>


My original report:

https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@xxxxxxxxxxxxx/

Reported-by: Chris Hixon <linux-kernel-bugs@xxxxxxxxxxxxx>


Thanks!

>
> On 01.10.24 23:40, Chris Hixon wrote:
>> On 10/1/2024, 12:56:49 PM, "Linux regression tracking (Thorsten Leemhuis)" wrote:
>
>>> Basavaraj Natikar, I noticed a report about a regression in
>>> bugzilla.kernel.org that appears to be caused by a change of yours:
>>>
>>> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is available")
>>> [v6.9-rc1]
>>>
>>> As many (most?) kernel developers don't keep an eye on the bug tracker,
>>> I decided to write this mail. To quote from
>>> https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
>>>
>>>> I am getting bad page map errors on kernel version 6.9 or newer.
>>>> They always appear within a few minutes of the system being on, if
>>>> not immediately upon booting. My system is a Dell Inspiron 7405.
>> [...]
>>>> [ 23.234632] systemd-journald[611]: File /var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
>>>> [ 23.580724] rfkill: input handler enabled
>>>> [ 25.652067] rfkill: input handler disabled
>>
>>>> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover, sensors not enabled is 0
>>>> [ 34.222379] pcie_mp2_amd 0000:03:00.7: amd_sfh_hid_client_init failed err -95
>>
>> No sensors detected - do we all have that in common?
>
> Skyler, Richard?
>
>>>> [...]
>>> See the ticket for more details and the bisection result. Skyler, the
>>> reporter (CCed), later also added:
>>>
>>>> Occasionally I will not get the usual bad page map error, but
>>>> instead some BTRFS errors followed by the file system going read-only.
>>>
>>> Note, we had and earlier regression caused by this change reported by
>>> Chris Hixon that maybe was not solved completely:
>>> https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@xxxxxxxxxxxxx/
>>
>> This looks like the same issue I reported.
>
> And sounds a lot like what Richard sees, who also sees disk corruption
> with Btrfs (see https://bugzilla.redhat.com/show_bug.cgi?id=2314331 ).
>
>>> Chris Hixon: do you still encounter errors, or was your issue
>>> resolved/vanished somehow?
>>
>> I still encounter errors with every kernel/patch I've tested. I've blacklisted
>> the amd_sfh module as a workaround, but when the module is inserted, a crash
>> similar to those reported will happen soon after the (45 second?)
>> detection/initialization timeout. It seems to affect whatever part of the
>> kernel next becomes active. I've had disk corruption as well, when BTRFS is
>> affected by the memory corruption,
>
> Skyler, did you see btrfs disk corruption as well, just like Chris and
> Richard did?
>
>> so I've ended up testing on a USB stick I
>> can reformat if necessary. I haven't tested new patches/kernels in a while
>> though. I'll get back to you after I've tried the latest mainline. Also note
>> that I've tried Fedora Rawhide's debug kernel,
>
> From what I see it seems all three of you are using Fedora. Wonder if
> that is a coincidence.

Note: I don't think it's a Fedora issue. I've had the problem on multiple
distros, with any kernel >= 6.9 - anything with the "bad" commit.

>> which has a ton of debugging
>> options including KASAN, but nothing seems to point the finger at something
>> originating in amd_sfh code. Is it possible the hardware itself (the mp2/sfh
>> chip) is corrupting memory somehow after some misstep in
>> initialization/de-initialization? Also if you look at my report, you'll see I
>> have no devices/sensors detected by amd_sfh - I wonder if other reporters all
>> have this in common? (noted in dmesg output above from another user)
>
> Given that Basavaraj Natikar never really addressed Chris earlier report
> from months ago and the severeness of the problem I'd wonder if we
> should revert the culprit to resolve this quickly, unless some proper
> fix comes into sight soon. Sadly from a quick look that would require
> multiple reverts afaics. :-/
>
> Ciao, Thorsten
>