Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only]

From: Basavaraj Natikar
Date: Wed Oct 02 2024 - 10:08:55 EST

Next message: Andy Shevchenko: "Re: [PATCH v1 1/1] auxdisplay: ht16k33: Make use of i2c_get_match_data()"
Previous message: Google: "Re: [PATCH v2] ftrace: Hide a extra entry in stack trace"
In reply to: Linux regression tracking (Thorsten Leemhuis): "Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only]"
Next in thread: Chris Hixon: "Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/2/2024 6:19 PM, Richard Shaw wrote:

On Wed, Oct 2, 2024 at 7:30 AM Linux regression tracking (Thorsten Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:

>> Basavaraj Natikar, I noticed a report about a regression in
>> bugzilla.kernel.org <http://bugzilla.kernel.org> that appears
to be caused by a change of yours:
>>
>> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is
available")
>> [v6.9-rc1]
>>
>> As many (most?) kernel developers don't keep an eye on the bug
tracker,
>> I decided to write this mail. To quote from
>> https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
>>
>>> I am getting bad page map errors on kernel version 6.9 or newer.
>>> They always appear within a few minutes of the system being on, if
>>> not immediately upon booting. My system is a Dell Inspiron 7405.
> [...]
>>> [ 23.234632] systemd-journald[611]: File
/var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal
corrupted or uncleanly shut down, renaming and replacing.
>>> [ 23.580724] rfkill: input handler enabled
>>> [ 25.652067] rfkill: input handler disabled
>
>>> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover,
sensors not enabled is 0
>>> [ 34.222379] pcie_mp2_amd 0000:03:00.7:
amd_sfh_hid_client_init failed err -95
>
> No sensors detected - do we all have that in common?

As in all system there is a issue there is no sensor supported.

My last log was with 6.11.0-debug[1] and found this:

[ 40.178603] kernel: pcie_mp2_amd 0000:04:00.7: Failed to discover, sensors not enabled is 0
[ 40.178904] kernel: pcie_mp2_amd 0000:04:00.7: amd_sfh_hid_client_init failed err -95
[ 43.913688] kernel: Oops: general protection fault, probably for non-canonical address 0x3ffe71b40000848: 0000 [#1] PREEMPT SMP KASAN NOPTI

Since I am unable to reproduce this issue, I added a debug patch to the bug ID.
Could you please try it?

Thanks,
--
Basavaraj

Interestingly the first OOPS was right after the amd_sfh tried to load (if I'm interpreting the above correctly).

>> See the ticket for more details and the bisection result.
Skyler, the
>> reporter (CCed), later also added:
>>
>>> Occasionally I will not get the usual bad page map error, but
>>> instead some BTRFS errors followed by the file system going
read-only.
>>
>> Note, we had and earlier regression caused by this change
reported by
>> Chris Hixon that maybe was not solved completely:
>>
https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@xxxxxxxxxxxxx/
>
> This looks like the same issue I reported.

And sounds a lot like what Richard sees, who also sees disk corruption
with Btrfs (see https://bugzilla.redhat.com/show_bug.cgi?id=2314331 ).

<snip>

> I still encounter errors with every kernel/patch I've tested.
I've blacklisted
> the amd_sfh module as a workaround, but when the module is
inserted, a crash
> similar to those reported will happen soon after the (45 second?)
> detection/initialization timeout. It seems to affect whatever
part of the
> kernel next becomes active. I've had disk corruption as well,
when BTRFS is
> affected by the memory corruption,

Skyler, did you see btrfs disk corruption as well, just like Chris and
Richard did?

Yes, most of the time the btrfs write checker catches the problem but not always. I've had to reinstall F40 3 times while debugging this issue for uncorrectable errors. When I run the debug kernel I think it brings the system to a halt so fast it doesn't have time to write the corruption to disk.

From what I see it seems all three of you are using Fedora. Wonder if
that is a coincidence.

Possibly. Can't say there isn't some patch we're using that's helping cause or expose the issue but Fedora tends to run the newest packages (including the Linux kernel) so can sometimes be the early warning system for other distros.

Thanks,
RIchard

[1] https://bugzilla-attachments.redhat.com/attachment.cgi?id=2049688

Next message: Andy Shevchenko: "Re: [PATCH v1 1/1] auxdisplay: ht16k33: Make use of i2c_get_match_data()"
Previous message: Google: "Re: [PATCH v2] ftrace: Hide a extra entry in stack trace"
In reply to: Linux regression tracking (Thorsten Leemhuis): "Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only]"
Next in thread: Chris Hixon: "Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]