Re: Boot failure due to: x86/boot: Save fields explicitly, zero out everything else

From: Neil MacLeod
Date: Wed Aug 21 2019 - 14:32:41 EST


Hi John

I can test any patches given a link to the diff, and am happy to do so.

If I've understood your suggestion correctly, I will try commenting
out all of the entries, then add them back one-by-one until I get a
non-booting situation. I'll let you know how I get on.

The OS I'm testing is LibreELEC, which is a custom x86_64 build, and
this hasn't shown any problems on this Skylake i5 NUC in all the years
I've been using it as a test system (since at least 4.15.y). So far
5.3-rc has been trouble-free until 5.3-rc5. As I mentioned in the bug,
I've been able to boot 5.3-rc5 from a USB memory stick, but can't boot
5.3-rc5 from the internal M2 drive unless I revert this commit.

Regards
Neil

On Wed, 21 Aug 2019 at 19:20, John Hubbard <jhubbard@xxxxxxxxxx> wrote:
>
> On 8/21/19 10:05 AM, Neil MacLeod wrote:
> > Hi John
> >
> > The following bug might be of interest:
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=204645
> >
> > Let me know if I can be of any help.
> >
>
> Hi Neil,
>
> First of all, I'm pasting in the bug information so that it's directly available
> here:
>
> ===================================================================
> Description: Neil MacLeod 2019-08-21 16:29:19 UTC
> System: Intel i5 Skylake NUC (NUC6i5SYH)
>
> This system boots fine from internal M2 (128GB) drive with 5.3-rc4.
>
> With 5.3-rc5, it does not boot from M2 and is stuck on the Intel splash screen (no other text is displayed, no panic etc.). It will boot 5.3-rc5 from a USB flash memory stick (via the F10 boot menu), but not from the internal M2.
>
> Bisecting between 5.3-rc4 and 5.3-rc5, the bad commit is:
>
> neil@nm-linux:~/projects/pullrequest_repos/torvalds-linux$ git bisect bad
> a90118c445cc7f07781de26a9684d4ec58bfcfd1 is the first bad commit
> commit a90118c445cc7f07781de26a9684d4ec58bfcfd1
> Author: John Hubbard <jhubbard@xxxxxxxxxx>
> Date: Tue Jul 30 22:46:27 2019 -0700
>
> x86/boot: Save fields explicitly, zero out everything else
>
> Recent gcc compilers (gcc 9.1) generate warnings about an out of bounds
> memset, if the memset goes accross several fields of a struct. This
> generated a couple of warnings on x86_64 builds in sanitize_boot_params().
>
> Fix this by explicitly saving the fields in struct boot_params
> that are intended to be preserved, and zeroing all the rest.
>
> [ tglx: Tagged for stable as it breaks the warning free build there as well ]
>
> Suggested-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Suggested-by: H. Peter Anvin <hpa@xxxxxxxxx>
> Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Link: https://lkml.kernel.org/r/20190731054627.5627-2-jhubbard@xxxxxxxxxx
>
> :040000 040000 e0963edca990540dd759765a3d765af4698df892 d07e645eb3a500c31bd65526205e286ff6941187 M arch
>
> Comment 1 Neil MacLeod 2019-08-21 16:31:35 UTC
> The kernel is built with gcc-9.2.0.
>
> Comment 2 Neil MacLeod 2019-08-21 16:55:48 UTC
>
> 5.3-rc5 with "x86/boot: Save fields explicitly, zero out everything else" reverted will build with gcc-9.2.0, and boot from M2.
> ===================================================================
>
> I'm also CC'ing the lists, so they know that the patch has caused a regression, and
> also out of hope that they can help me choose the shortest path forward to debugging
> this. My first reaction is that the list of BOOT_PARAM_PRESERVE() fields is
> flawed--by which I include cases of "the system improperly relied on a field that the spec said
> should be zeroed". (After all, the boot_params->sentinel is set, which already means
> the system is not really doing it right.)
>
> So I'm going to cheat and ask right now if anyone notices either ommissions
> or wrong entries here:
>
> static void sanitize_boot_params(struct boot_params *boot_params)
> {
> ...
> const struct boot_params_to_save to_save[] = {
> BOOT_PARAM_PRESERVE(screen_info),
> BOOT_PARAM_PRESERVE(apm_bios_info),
> BOOT_PARAM_PRESERVE(tboot_addr),
> BOOT_PARAM_PRESERVE(ist_info),
> BOOT_PARAM_PRESERVE(acpi_rsdp_addr),
> BOOT_PARAM_PRESERVE(hd0_info),
> BOOT_PARAM_PRESERVE(hd1_info),
> BOOT_PARAM_PRESERVE(sys_desc_table),
> BOOT_PARAM_PRESERVE(olpc_ofw_header),
> BOOT_PARAM_PRESERVE(efi_info),
> BOOT_PARAM_PRESERVE(alt_mem_k),
> BOOT_PARAM_PRESERVE(scratch),
> BOOT_PARAM_PRESERVE(e820_entries),
> BOOT_PARAM_PRESERVE(eddbuf_entries),
> BOOT_PARAM_PRESERVE(edd_mbr_sig_buf_entries),
> BOOT_PARAM_PRESERVE(edd_mbr_sig_buffer),
> BOOT_PARAM_PRESERVE(e820_table),
> BOOT_PARAM_PRESERVE(eddbuf),
> };
>
> If not, then I think we need to bisect by...well, let's start with the
> theory that we zeroed out too much, so we could start adding more fields
> to preserve. If that doesn't find the problem, then it's more complicated,
> and might be better to go the other direction: starting without my commit,
> and zeroing out fields until we see the failure.
>
> Are you able to test patches? It would take some time, since there are quite a
> few fields. Alternately, if you provide some more system information details
> (especially if we have any other notes about other failures and passes), then
> I might be able to borrow a Skylake system and attempt a repro.
>
>
> thanks,
> --
> John Hubbard
> NVIDIA