Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only]
From: Linux regression tracking (Thorsten Leemhuis)
Date: Wed Oct 02 2024 - 08:30:21 EST
[CCing Richard, who apparently faces the same problem according to a
recent comment in the bugzilla ticket mentioned earlier:
https://bugzilla.kernel.org/show_bug.cgi?id=219331#c8
CCing Mario, who might be interested in this and is a good contact when
it comes to issues with AMD stuff like this.
CCing the Btrfs list as JFYI, as all three reporters afaics see Btrfs
misbehavior or corruptions due to this.
Considered to bring Linus in, but decided to wait a bit before doing so.]
On 01.10.24 23:40, Chris Hixon wrote:
> On 10/1/2024, 12:56:49 PM, "Linux regression tracking (Thorsten Leemhuis)" wrote:
>> Basavaraj Natikar, I noticed a report about a regression in
>> bugzilla.kernel.org that appears to be caused by a change of yours:
>>
>> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is available")
>> [v6.9-rc1]
>>
>> As many (most?) kernel developers don't keep an eye on the bug tracker,
>> I decided to write this mail. To quote from
>> https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
>>
>>> I am getting bad page map errors on kernel version 6.9 or newer.
>>> They always appear within a few minutes of the system being on, if
>>> not immediately upon booting. My system is a Dell Inspiron 7405.
> [...]
>>> [ 23.234632] systemd-journald[611]: File /var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
>>> [ 23.580724] rfkill: input handler enabled
>>> [ 25.652067] rfkill: input handler disabled
>
>>> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover, sensors not enabled is 0
>>> [ 34.222379] pcie_mp2_amd 0000:03:00.7: amd_sfh_hid_client_init failed err -95
>
> No sensors detected - do we all have that in common?
Skyler, Richard?
>>> [...]
>> See the ticket for more details and the bisection result. Skyler, the
>> reporter (CCed), later also added:
>>
>>> Occasionally I will not get the usual bad page map error, but
>>> instead some BTRFS errors followed by the file system going read-only.
>>
>> Note, we had and earlier regression caused by this change reported by
>> Chris Hixon that maybe was not solved completely:
>> https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@xxxxxxxxxxxxx/
>
> This looks like the same issue I reported.
And sounds a lot like what Richard sees, who also sees disk corruption
with Btrfs (see https://bugzilla.redhat.com/show_bug.cgi?id=2314331 ).
>> Chris Hixon: do you still encounter errors, or was your issue
>> resolved/vanished somehow?
>
> I still encounter errors with every kernel/patch I've tested. I've blacklisted
> the amd_sfh module as a workaround, but when the module is inserted, a crash
> similar to those reported will happen soon after the (45 second?)
> detection/initialization timeout. It seems to affect whatever part of the
> kernel next becomes active. I've had disk corruption as well, when BTRFS is
> affected by the memory corruption,
Skyler, did you see btrfs disk corruption as well, just like Chris and
Richard did?
> so I've ended up testing on a USB stick I
> can reformat if necessary. I haven't tested new patches/kernels in a while
> though. I'll get back to you after I've tried the latest mainline. Also note
> that I've tried Fedora Rawhide's debug kernel,
>From what I see it seems all three of you are using Fedora. Wonder if
that is a coincidence.
> which has a ton of debugging
> options including KASAN, but nothing seems to point the finger at something
> originating in amd_sfh code. Is it possible the hardware itself (the mp2/sfh
> chip) is corrupting memory somehow after some misstep in
> initialization/de-initialization? Also if you look at my report, you'll see I
> have no devices/sensors detected by amd_sfh - I wonder if other reporters all
> have this in common? (noted in dmesg output above from another user)
Given that Basavaraj Natikar never really addressed Chris earlier report
from months ago and the severeness of the problem I'd wonder if we
should revert the culprit to resolve this quickly, unless some proper
fix comes into sight soon. Sadly from a quick look that would require
multiple reverts afaics. :-/
Ciao, Thorsten