Re: [PATCH 0/2] ESRT fixes for relocatable kexec'd kernel
From: AKASHI Takahiro
Date: Thu Mar 08 2018 - 05:25:26 EST
Ard,
On Wed, Mar 07, 2018 at 07:55:34PM +0000, Ard Biesheuvel wrote:
> (+ James)
>
> Hello Akashi,
>
> On 6 March 2018 at 09:00, AKASHI Takahiro <takahiro.akashi@xxxxxxxxxx> wrote:
> > Tyler, Jeffrey,
> >
> > On Fri, Mar 02, 2018 at 08:27:11AM -0500, Tyler Baicar wrote:
> >> On 3/2/2018 12:53 AM, AKASHI Takahiro wrote:
> >> >Tyler, Jeffrey,
> >> >
> >> >[Note: This issue takes place in kexec, not kdump. So to be precise,
> >> >it is not the same phenomenon as what I addressed in [1],[2]:
> >> > [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/557254.html
> >> > [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-January/553098.html
> >> >]
> >> >
> >> >On Thu, Mar 01, 2018 at 12:56:38PM -0500, Tyler Baicar wrote:
> >> >>Hello,
> >> >>
> >> >>On 2/28/2018 9:50 PM, AKASHI Takahiro wrote:
> >> >>>Hi,
> >> >>>
> >> >>>On Wed, Feb 28, 2018 at 08:39:42AM -0700, Jeffrey Hugo wrote:
> >> >>>>On 2/27/2018 11:19 PM, AKASHI Takahiro wrote:
> >> >>>>>Tyler,
> >> >>>>>
> >> >>>>># I missed catching your patch as its subject doesn't contain arm64.
> >> >>>>>
> >> >>>>>On Fri, Feb 23, 2018 at 12:42:31PM -0700, Tyler Baicar wrote:
> >> >>>>>>Currently on arm64 ESRT memory does not appear to be properly blocked off.
> >> >>>>>>Upon successful initialization, ESRT prints out the memory region that it
> >> >>>>>>exists in like:
> >> >>>>>>
> >> >>>>>>esrt: Reserving ESRT space from 0x000000000a4c1c18 to 0x000000000a4c1cf0.
> >> >>>>>>
> >> >>>>>>But then by dumping /proc/iomem this region appears as part of System RAM
> >> >>>>>>rather than being reserved:
> >> >>>>>>
> >> >>>>>>08f10000-0deeffff : System RAM
> >> >>>>>>
> >> >>>>>>This causes issues when trying to kexec if the kernel is relocatable. When
> >> >>>>>>kexec tries to execute, this memory can be selected to relocate the kernel to
> >> >>>>>>which then overwrites all the ESRT information. Then when the kexec'd kernel
> >> >>>>>>tries to initialize ESRT, it doesn't recognize the ESRT version number and
> >> >>>>>>just returns from efi_esrt_init().
> >> >>>>>I'm not sure what is the root cause of your problem.
> >> >>>>>Do you have good confidence that the kernel (2nd kernel image in this case?)
> >> >>>>>really overwrite ESRT region?
> >> >>>>According to my debug, yes.
> >> >>>>Using JTAG, I was able to determine that the ESRT memory region was getting
> >> >>>>overwritten by the secondary kernel in
> >> >>>>kernel/arch/arm64/kernel/relocate_kernel.S - specifically the "copy_page"
> >> >>>>line of arm64_relocate_new_kernel()
> >> >>>>
> >> >>>>>To my best knowledge, kexec is carefully designed not to do such a thing
> >> >>>>>as it allocates a temporary buffer for kernel image and copies it to the
> >> >>>>>final destination at the very end of the 1st kernel.
> >> >>>>>
> >> >>>>>My guess is that kexec, or rather kexec-tools, tries to load the kernel image
> >> >>>>>at 0x8f80000 (or 0x9080000?, not sure) in your case. It may or may not be
> >> >>>>>overlapped with ESRT.
> >> >>>>>(Try "-d" option when executing kexec command for confirmation.)
> >> >>>>With -d, I see
> >> >>>>
> >> >>>>get_memory_ranges_iomem_cb: 0000000009611000 - 000000000e5fffff : System RAM
> >> >>>>
> >> >>>>That overlaps the ESRT reservation -
> >> >>>>[ 0.000000] esrt: Reserving ESRT space from 0x000000000b708718 to
> >> >>>>0x000000000b7087f0
> >> >>>>
> >> >>>>>Are you using initrd with kexec?
> >> >>>>Yes
> >> >>>To make the things clear, can you show me, if possible, the followings:
> >> >>I have attached all of these:
> >> >Many thanks.
> >> >According to the data, ESRT was overwritten by initrd, not the kernel image.
> >> >It doesn't matter to you though :)
> >> >
> >> >The solution would be, as Ard suggested, that more information be
> >> >added to /proc/iomem.
> >> >I'm going to fix the issue as quickly as possible.
> >> Great, thank you!! Please add us to the fix and we will gladly test it out.
> >
> > I have created a workaround patch, attached below, as a kind of PoC.
> > Can you give it a go, please?
> > You need another patch for kexec-tools, too. See
> > https:/git.linaro.org/people/takahiro.akashi/kexecl-tools.git arm64/resv_mem
> >
>
> Thanks for putting this together. Some questions below.
>
> > With this patch, extra entries for firmware-reserved memory resources,
> > which means any regions that are already reserved before arm64_memblock_init(),
> > or specifically efi/acpi tables in this case, are added to /proc/iomem.
> >
> > $ cat /proc/iomem (on my qemu+edk2 execution)
> > ...
> > 40000000-5871ffff : System RAM
> > 40080000-40f1ffff : Kernel code
> > 41040000-411e9fff : Kernel data
> > 54400000-583fffff : Crash kernel
> > 58590000-585effff : reserved
> > 58700000-5871ffff : reserved
> > 58720000-58b5ffff : reserved
> > 58b60000-5be3ffff : System RAM
> > 58b61000-58b61fff : reserved
> > 59a7b118-59a7b667 : reserved
> > 5be40000-5becffff : reserved
> > 5bed0000-5bedffff : System RAM
> > 5bee0000-5bffffff : reserved
> > 5c000000-5fffffff : System RAM
> > 5ec00000-5edfffff : reserved
> > 8000000000-ffffffffff : PCI Bus 0000:00
> > 8000000000-8000003fff : 0000:00:01.0
> > 8000000000-8000003fff : virtio-pci-modern
> >
> > While all the entries are currently marked as just "reserved," we'd better
> > give them more specific names for general/extensive use.
> > (Then it will require modifying respective fw/drivers.)
> >
> > Kexec-tools will allocate spaces for kernel, initrd and dtb so that
> > they will not be overlapped with "reserved" memory.
> >
> > As I haven't run extensive tests, please let me know if you find
> > any problems.
> >
> > Thanks,
> > -Takahiro AKASHI
> >
> >>
> >> Thanks,
> >> Tyler
> >>
> >> --
> >> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
> >> Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
> >> a Linux Foundation Collaborative Project.
> >>
> > ===8<===
> > From 57d93b89d16b967c913f3949601a5559ddf4aa57 Mon Sep 17 00:00:00 2001
> > From: AKASHI Takahiro <takahiro.akashi@xxxxxxxxxx>
> > Date: Fri, 2 Mar 2018 16:39:18 +0900
> > Subject: [PATCH] arm64: kexec: set asdie firmware-reserved memory regions
> >
> > Signed-off-by: AKASHI Takahiro <takahiro.akashi@xxxxxxxxxx>
> > ---
> > arch/arm64/kernel/setup.c | 24 ++++++++++++++++++++----
> > arch/arm64/mm/init.c | 21 +++++++++++++++++++++
> > 2 files changed, 41 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> > index 30ad2f085d1f..997f07e86243 100644
> > --- a/arch/arm64/kernel/setup.c
> > +++ b/arch/arm64/kernel/setup.c
> > @@ -87,6 +87,9 @@ static struct resource mem_res[] = {
> > #define kernel_code mem_res[0]
> > #define kernel_data mem_res[1]
> >
> > +/* TODO: Firmware-reserved memory resources */
> > +extern struct memblock_type fw_mem;
> > +
>
> Why do you need this intermediate data structure? can't you iterate
> over the memblock_reserve'd regions directly?
As you know, bunch of memory regions are *reserved* by the kernel itself
between arm64_memblock_init() and request_standard_resources(), but none
of them need to be protected across two kernels (via kexec) as they are
anyhow dynamic.
Hence we should only take care of the regions that are allocated/reserved
by firmware (UEFI).
This intermediate data is used solely for keeping track of them until
we define iomem resources in request_standard_resources().
(In my next version, I will take a bit smarter approach with calling,
say, arch_export_iomem_entry() at every occurrence of memblock_reserve()
in efi code.)
In fact, what my patch does here is an "overkill."
Device tree blob (fdt) provided by the firmware need not be preserved
because kexec expects a new fdt to be provided via kexec_load system call
by users, or otherwise by kexec-tools which will by default duplicate one
from the current kernel's fdt.
> > /*
> > * The recorded values of x0 .. x3 upon kernel entry.
> > */
> > @@ -206,7 +209,20 @@ static void __init request_standard_resources(void)
> > {
> > struct memblock_region *region;
> > struct resource *res;
> > + int i;
> > +
> > + /* add firmware-reserved memory first */
> > + for (i = 1; i < fw_mem.cnt; i++) {
> > + res = alloc_bootmem_low(sizeof(*res));
> > + res->name = "reserved";
> > + res->flags = IORESOURCE_MEM;
> > + res->start = fw_mem.regions[i].base;
> > + res->end = fw_mem.regions[i].base + fw_mem.regions[i].size - 1;
> >
> > + request_resource(&iomem_resource, res);
> > + }
> > +
> > + /* add standard resources */
> > kernel_code.start = __pa_symbol(_text);
> > kernel_code.end = __pa_symbol(__init_begin - 1);
> > kernel_data.start = __pa_symbol(_sdata);
> > @@ -224,19 +240,19 @@ static void __init request_standard_resources(void)
> > res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region));
> > res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;
> >
> > - request_resource(&iomem_resource, res);
> > + insert_resource(&iomem_resource, res);
> >
>
> I see why you need to switch from request_resource() to
> insert_resource(). Could we ever run into the situation where a
> memblock_reserved region overlaps the boundary between System RAM and
> a NOMAP region?
Theoretically, yes.
For example, there are two neighboring reserved regions, A and B.
System RAM is immediately followed by NOMAP region.
If A is located at the very end of System RAM *and* B is located
at the beginning of the NOMAP region, what will happens?
A and B are supposed to be declared as "reserved" independently,
but memblock_add() will merge them into one because they are back-to-back.
(I assume here that any region can be marked "reserved" and NOMAP
at the same time.)
To be honest with you, the notion of NOMAP is very much confusing and
annoying me when I try to think of a solution for this problem.
> I don't /think/ this is the case, but I am not sure,
In this specific case (kernel text/data), they won't cross the boundary
between System RAM and NOMAP region.
> and if that happens, this will fail. (I think it should never be
> needed to memblock_reserve() NOMAP regions but I am not 100% sure)
Yes.
BTW, I think that I found some inconsistency in memory handling in efi.
your commit cb82cce7035e ("efi/arm64: Treat regions with WT/WC set but
WB cleared as memory") makes WT/WC memory counted as memblock memory,
i.e. memblock_is_memory() now returns true. On the other hand,
acpi_os_ioremap() is set to map such kind of regions by ioremap_cache(),
which will create a mapping with WB attribute.
While it may not cause any problem, it is not what we expect to see
in terms of UEFI/ACPI specification(?).
Thanks,
-Takahiro AKASHI
> > if (kernel_code.start >= res->start &&
> > kernel_code.end <= res->end)
> > - request_resource(res, &kernel_code);
> > + insert_resource(res, &kernel_code);
> > if (kernel_data.start >= res->start &&
> > kernel_data.end <= res->end)
> > - request_resource(res, &kernel_data);
> > + insert_resource(res, &kernel_data);
> > #ifdef CONFIG_KEXEC_CORE
> > /* Userspace will find "Crash kernel" region in /proc/iomem. */
> > if (crashk_res.end && crashk_res.start >= res->start &&
> > crashk_res.end <= res->end)
> > - request_resource(res, &crashk_res);
> > + insert_resource(res, &crashk_res);
> > #endif
> > }
> > }
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 9f3c47acf8ff..b6f86a7bbfb7 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -62,6 +62,14 @@
> > s64 memstart_addr __ro_after_init = -1;
> > phys_addr_t arm64_dma_phys_limit __ro_after_init;
> >
> > +static struct memblock_region fw_mem_regions[INIT_MEMBLOCK_REGIONS];
> > +struct memblock_type fw_mem = {
> > + .regions = fw_mem_regions,
> > + .cnt = 1, /* empty dummy entry */
> > + .max = INIT_MEMBLOCK_REGIONS,
> > + .name = "firmware-reserved memory",
> > +};
> > +
> > #ifdef CONFIG_BLK_DEV_INITRD
> > static int __init early_initrd(char *p)
> > {
> > @@ -362,6 +370,19 @@ static void __init fdt_enforce_memory_region(void)
> > void __init arm64_memblock_init(void)
> > {
> > const s64 linear_region_size = -(s64)PAGE_OFFSET;
> > + struct memblock_region *region;
> > +
> > + /*
> > + * Export firmware-reserved memory regions
> > + * TODO: via more generic interface
> > + */
> > + for_each_memblock(reserved, region) {
> > + if (WARN_ON(fw_mem.cnt >= fw_mem.max))
> > + break;
> > + fw_mem.regions[fw_mem.cnt].base = region->base;
> > + fw_mem.regions[fw_mem.cnt].size = region->size;
> > + fw_mem.cnt++;
> > + }
> >
> > /* Handle linux,usable-memory-range property */
> > fdt_enforce_memory_region();
> > --
> > 2.16.2
> >