On Wed, Oct 2, 2024 at 7:30 AM Linux regression tracking (Thorsten Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:As in all system there is a issue there is no sensor supported.
>> Basavaraj Natikar, I noticed a report about a regression in
>> bugzilla.kernel.org <http://bugzilla.kernel.org> that appears
to be caused by a change of yours:
>>
>> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is
available")
>> [v6.9-rc1]
>>
>> As many (most?) kernel developers don't keep an eye on the bug
tracker,
>> I decided to write this mail. To quote from
>> https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
>>
>>> I am getting bad page map errors on kernel version 6.9 or newer.
>>> They always appear within a few minutes of the system being on, if
>>> not immediately upon booting. My system is a Dell Inspiron 7405.
> [...]
>>> [ 23.234632] systemd-journald[611]: File
/var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal
corrupted or uncleanly shut down, renaming and replacing.
>>> [ 23.580724] rfkill: input handler enabled
>>> [ 25.652067] rfkill: input handler disabled
>
>>> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover,
sensors not enabled is 0
>>> [ 34.222379] pcie_mp2_amd 0000:03:00.7:
amd_sfh_hid_client_init failed err -95
>
> No sensors detected - do we all have that in common?
My last log was with 6.11.0-debug[1] and found this:
[ 40.178603] kernel: pcie_mp2_amd 0000:04:00.7: Failed to discover, sensors not enabled is 0
[ 40.178904] kernel: pcie_mp2_amd 0000:04:00.7: amd_sfh_hid_client_init failed err -95
[ 43.913688] kernel: Oops: general protection fault, probably for non-canonical address 0x3ffe71b40000848: 0000 [#1] PREEMPT SMP KASAN NOPTI
Interestingly the first OOPS was right after the amd_sfh tried to load (if I'm interpreting the above correctly).
>> See the ticket for more details and the bisection result.
Skyler, the
>> reporter (CCed), later also added:
>>
>>> Occasionally I will not get the usual bad page map error, but
>>> instead some BTRFS errors followed by the file system going
read-only.
>>
>> Note, we had and earlier regression caused by this change
reported by
>> Chris Hixon that maybe was not solved completely:
>>
https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@xxxxxxxxxxxxx/
>
> This looks like the same issue I reported.
And sounds a lot like what Richard sees, who also sees disk corruption
with Btrfs (see https://bugzilla.redhat.com/show_bug.cgi?id=2314331 ).
<snip>
> I still encounter errors with every kernel/patch I've tested.
I've blacklisted
> the amd_sfh module as a workaround, but when the module is
inserted, a crash
> similar to those reported will happen soon after the (45 second?)
> detection/initialization timeout. It seems to affect whatever
part of the
> kernel next becomes active. I've had disk corruption as well,
when BTRFS is
> affected by the memory corruption,
Skyler, did you see btrfs disk corruption as well, just like Chris and
Richard did?
Yes, most of the time the btrfs write checker catches the problem but not always. I've had to reinstall F40 3 times while debugging this issue for uncorrectable errors. When I run the debug kernel I think it brings the system to a halt so fast it doesn't have time to write the corruption to disk.
From what I see it seems all three of you are using Fedora. Wonder if
that is a coincidence.
Possibly. Can't say there isn't some patch we're using that's helping cause or expose the issue but Fedora tends to run the newest packages (including the Linux kernel) so can sometimes be the early warning system for other distros.
Thanks,
RIchard
[1] https://bugzilla-attachments.redhat.com/attachment.cgi?id=2049688