Re: [PATCH 1/2][RFC] x86/boot/e820: Introduce e820_table_ori to represent the real original e820 layout
From: Chen Yu
Date: Sun Jul 02 2017 - 13:01:39 EST
On Fri, Jun 23, 2017 at 10:42:10AM +0200, Ingo Molnar wrote:
> * Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> > Hi Ingo,
> > On Thu, Jun 22, 2017 at 11:40:30AM +0200, Ingo Molnar wrote:
> > >
> > > * Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> > >
> > > > Currently we try to have e820_table_firmware to represent the
> > > > original firmware memory layout passed to us by the bootloader,
> > > > however it is not the case, the e820_table_firmware might still
> > > > be modified by linux:
> > > > 1. During bootup, the efi boot stub might allocate memory via
> > > > efi service for the PCI device information structure, then
> > > > later e820_reserve_setup_data() reserved these dynamically
> > > > allocated structures(AKA, setup_data) in e820_table_firmware
> > > > accordingly.
> > > > 2. The kexec might also modify the e820_table_firmware.
> > >
> > > Hm, so why does the EFI code modify e280_table_firmware - why doesn't
> > > it modify e820_table?
> > >
> > Both the e820_table and e820_table_firmware will be updated in
> > e820__reserve_setup_data():
> > Changing the PCI device information structures from E820_TYPE_RAM
> > to E820_TYPE_RESERVED_KERN.
> > > I.e. what is the point of having 3 different versions of the
> > > memory layout table?
> > My original thought was that, we should not record the modification
> > from the efi boot stub into the e820_tabel_firmware and we are done.
> > But after checking the code, I realized that if we do so the
> > kexec might have potiential problem.
> > The e820_table_firmware was introduced mainly for kexec and
> > was used to pass the original memory layout to the second
> > kernel:
> > commit 5dfcf14d5b28174f94cbe9b4fb35d415db61c64a
> > Author: Bernhard Walle <bwalle@xxxxxxx>
> > Date: Fri Jun 27 13:12:55 2008 +0200
> > x86: use FIRMWARE_MEMMAP on x86/E820
> > Besides, the second kernel will not re-enter the efi boot stub
> > code and it will reuse the PCI device information structure created
> > by the first kernel, which is stored in the E820_TYPE_RESERVED_KERN
> > region. So these PCI device information structures will not be
> > modified by the second kernel, as kexec will only pass the E820_TYPE_RAM
> > to the second kernel, thus the latter could leverage ioremap to access
> > the PCI information.
> > So the problem is, if we do not record the PCI information in
> > the e820_table_firmware, the PCI information will be kept as
> > type E820_TYPE_RAM, and all the E820_TYPE_RAM type regions will
> > be passed to the second kernel and might be allocated for ordinary
> > use in the second kernel, as a result the second kernel might not
> > get valid PCI information(might be overwritten by others). So
> > currently we try to introduce a new e820_table_ori to represent
> > the original one provided by the BIOS(mainly for hibernation
> > memory layout md5 checking).
> So there's 3 versions we need:
> - the original 'firmware' table as-is - for MD5 check and other potential
> - some intermediate version of the table for kexec: what is the exact definition
> of that table, what changes from the real table does it _not_ want?
Some boot options such as 'mem=' are not wanted by kexec, because the kexec
wants to let the second kernel see the whole memory layout passed by
the bootloader. I think this is why e820_table_firmware was introduced.
> - the 'real' table
> all the naming should reflect that. I.e. instead of some nonsensical "_ori"
> postfix, that is really the _firmware table. If kexec needs a separate one then
> name it _kexec and copy it at the right stage.
Ok. I'm sending V2 of this patch. I tried not to break the old behavior and
split the patch into three, thus the logic might look more clear.