Re: Bug: After a 'warm' reboot the disk is missing (not detected by the bios) on a HP t640

From: Sean Christopherson
Date: Wed Dec 20 2023 - 11:48:03 EST


+LKML and other maintainers

On Tue, Dec 19, 2023, Ben Mesman | Spark Narrowcasting wrote:
> Hello,
>
> First of all, sorry if this is not the proper way of submitting a patch.
> According to the get_maintainer.pl-script for arch/x86/kernel/reboot.c you
> are one of the maintainers of that file. If the patch should go to a
> different maintainer, can you direct me to either the right person, or the
> right location to find such a person?

Please don't send private mails. Kudos for using get_maintainer.pl, but a demerit
for not Cc'ing the mailing lists :-)

https://people.kernel.org/tglx/notes-about-netiquette

> I recently started upgrading some of my remote managed thin-clients from a
> 5.15.x kernel to a 6.1.x kernel. When rebooting with the new(er) kernel, the
> HP t640 clients failed. The problem is that after the warm reboot, the BIOS
> is unable to locate the internal storage (so it can't boot a valid OS).
>
> With some digging around I found that adding "reboot=p" will solve the
> problem, but because the systems are remote managed, I am unable to add this
> boot-parameter in any straightforward way.
>
> I reported the issue as a bug in Debian
> (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1056056) but to get it
> solved permanently, they told me to try and get a patch/solution upstream.
>
> What worked for me is the attached patch, which adds a reboot-quirk for the
> affected systems. I can submit this also as a pull-request (I think) but I
> don't know the proper way of submitting a pull-request, the last one I did
> was more than 10 years ago.
>
> Kind regards,
> Ben Mesman
> ben@xxxxxxxxxxxxxxxxxxxxx
>
> diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> index 830425e6d38e..d63cc44a1117 100644
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -468,6 +468,14 @@ static const struct dmi_system_id reboot_dmi_table[] __initconst = {
> DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq"),
> },
> },
> + { /* Handle problems with rebooting HP t640 thin-clients */
> + .callback = set_pci_reboot,
> + .ident = "HP t640",
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "HP"),
> + DMI_MATCH(DMI_PRODUCT_NAME, "HP t640 Thin Client"),
> + },
> + },

I'm not familiar with this code (I'm not actually a maintainer/reviewer for this
code, by default get_maintainer.pl Cc's people that have recently modified the
file in question), but this looks like a hack to workaround a bug elsewhere.

All of these quirks are obviously workarounds for some kind of bug, but AFAICT
the quirks are to workaround hardware or firmware bugs, not kernel bugs. Since
5.15.x kernels worked, odds are good a bug was introduced between 5.15 and 6.1,
i.e. that this is fudging around a kernel bug that can and should be fixed.

Are you able to bisect the kernel between 6.1 and 5.15 to try and pinpoint an
exact commit that introduced the problem?