Re: [PATCH V2] x86/kexec: do not update E820 kexec table for setup_data

From: Dave Young
Date: Thu Mar 21 2024 - 22:17:21 EST


Hi Jiri,

On Thu, 21 Mar 2024 at 18:32, Jiri Bohac <jbohac@xxxxxxx> wrote:
>
> Hi,
>
> On Thu, Mar 21, 2024 at 05:23:20PM +0800, Dave Young wrote:
> > crashkernel reservation failed on a Thinkpad t440s laptop recently.
> > Actually the memblock reservation succeeded, but later insert_resource()
> > failed.
> >
> > Test steps:
> > kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
> > kexec reboot ->
> > dmesg|grep "crashkernel reserved";
> > crashkernel memory range like below reserved successfully:
> > 0x00000000d0000000 - 0x00000000da000000
> > But no such "Crash kernel" region in /proc/iomem
> >
> > The background story is like below:
> >
> > Currently E820 code reserves setup_data regions for both the current
> > kernel and the kexec kernel, and it inserts them into the resources list.
> > Before the kexec kernel reboots nobody passes the old setup_data, and
> > kexec only passes fresh SETUP_EFI and SETUP_IMA if needed. Thus the old
> > setup data memory is not used at all.
> >
> > Due to old kernel updates the kexec e820 table as well so kexec kernel
> > sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
> > regions are inserted into resources list in the kexec kernel by
> > e820__reserve_resources().
> >
> > Note, due to no setup_data is passed in for those old regions they are not
> > early reserved (by function early_reserve_memory), and the crashkernel
> > memblock reservation will just treat them as usable memory and it could
> > reserve the crashkernel region which overlaps with the old setup_data
> > regions. And just like the bug I noticed here, kdump insert_resource
> > failed because e820__reserve_resources has added the overlapped chunks
> > in /proc/iomem already.
>
> wouldn't this be caused by
> 4a693ce65b186fddc1a73621bd6f941e6e3eca21 ("kdump: defer the
> insertion of crashkernel resources")?
>
> Before that the crashkernel resources were inserted from
> arch_reserve_crashkernel() which is called before
> e820__reserve_resources().

I think reverting the commit you mentioned can paper out this issue
but it is not
the root cause. Yes, arch_reserve_crashkernel can succeed, then e820
still tries
to reserve the setup_data overlapping with crashkernel for another purpose.

>
> The semantics of E820_TYPE_RESERVED_KERN wrt kexec quite
> inconsistent. It's treated as E820_TYPE_RAM by
> e820__memblock_setup() and e820_type_to_iomem_type().
>
> The problem we're seeing here is the result of the former.
> e820__memblock_setup() will add the E820_TYPE_RESERVED_KERN
> region to the memblock, merging with the neighbouring memblocks,
> allowing crashkernel region to span across the originally
> reserved area.
>
> e820_type_to_iomem_type() treating E820_TYPE_RESERVED_KERN as
> E820_TYPE_RAM will make the E820_TYPE_RESERVED_KERN appear as
> system ram in /proc/iomem. If the old kexec_load (not
> kexec_file_load) syscall is used, the userspace kexec utility
> will construct the e820 table based on the contents of
> /proc/iomem and the kexec kernel will see the
> E820_TYPE_RESERVED_KERN range as E820_TYPE_RAM. When
> kexec_file_load is used the E820_TYPE_RESERVED_KERN type is
> propagated to the kexec kernel by bzImage64_load() /
> setup_e820_entries().

This is true, but it does not matter for the kexec kernel as they are
only reserved for
the 1st kernel, and it is not meaningful to the kexec kernel. Use
them as system ram
is fine in the 2nd kexec kernel.

>
>
> > Index: linux/arch/x86/kernel/e820.c
> > ===================================================================
> > --- linux.orig/arch/x86/kernel/e820.c
> > +++ linux/arch/x86/kernel/e820.c
> > @@ -1015,16 +1015,6 @@ void __init e820__reserve_setup_data(voi
> > pa_next = data->next;
> >
> > e820__range_update(pa_data, sizeof(*data)+data->len, E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
> > -
> > - /*
> > - * SETUP_EFI and SETUP_IMA are supplied by kexec and do not need
> > - * to be reserved.
> > - */
> > - if (data->type != SETUP_EFI && data->type != SETUP_IMA)
> > - e820__range_update_kexec(pa_data,
> > - sizeof(*data) + data->len,
> > - E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
> > -
>
> Your tree is missing this recent commit:
> 7fd817c906503b6813ea3b41f5fdf4192449a707 ("x86/e820: Don't
> reserve SETUP_RNG_SEED in e820").
>
> Wouldn't this fix [/paper over] your problem as well? I.e., isn't
> SETUP_RNG_SEED the setup_data item that's causing your problem?

Thanks for catching this, I will rebase and repost.

But it does not "fix" the problem as my problem is related to the
other setup_data
range, I think it is SETUP_PCI (not 100% sure, but it is certainly not RNG_SEED)

>
> Regards,
>
> --
> Jiri Bohac <jbohac@xxxxxxx>
> SUSE Labs, Prague, Czechia
>
>
Thanks
Dave