Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation

From: James Morse
Date: Wed Jul 17 2019 - 13:51:51 EST

Next message: Johannes Weiner: "Re: [PATCH 2/2] mm/memcontrol: split local and nested atomic vmstats/vmevents counters"
Previous message: Heiko Stübner: "Re: [PATCH v1 1/1] arm64: dts: rockchip: Add support for TB-96AI board"
In reply to: Pavel Tatashin: "Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation"
Next in thread: Pavel Tatashin: "Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Pavel,

On 16/07/2019 17:56, Pavel Tatashin wrote:
> Added identity mapped page table, and keep MMU enabled while
> kernel is being relocated from sparse pages to the final
> destination during kexec.

The 'tl;dr' version of this: I strongly urge you to start with the hibernate code that
already covers all these known corner cases. x86 was not a good starting point.

After a quick skim:

This will map 'nomap' regions of memory with cacheable attributes. This is a non-starter.
These regions were described by firmware as having content that was/is written with
different attributes. The attributes must match whenever it is mapped, otherwise we have a
loss of coherency. Mapping this stuff as cacheable means the CPU can prefetch it into the
cache whenever it likes.
It may be important that we do not ever map some of these regions, even though its
described as memory. On AMD-Seattle the bottom page of memory is reserved by firmware for
its own use; it is made secure-only, and any access causes an
external-abort/machine-check. UEFI describes this as 'Reserved', and we preserve this in
the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly with the
'nomap' attribute.

Mapping a 'new'/unknown region with cacheable attributes can never be safe, even if we
trusted kexec-tool to only write the kernel to memory. The host may be using a bigger page
size causing more memory to become cacheable than was intended.
Linux's EFI support rounds the UEFI memory map to the largest support page size, (and
winges about firmware bugs).
If we're allowing kexec to load images in a region not described as IORESOURCE_SYSTEM_RAM,
that is a bug we should fix.

The only way to do this properly is to copy the linear mapping. The arch code has lots of
complex code to generate it correctly at boot, we do not want to duplicate it.
(this is why hibernate copies the linear mapping)

These patches do not remove the running page tables from TTBR1. As you overwrite the live
page tables you will corrupt the state of the CPU. The page-table walker may access things
that aren't memory, cache memory that shouldn't be cached (see above), and allocate
conflicting entries in the TLB.

You cannot use the mm page table helpers to build an idmap on arm64. The mm page table
helpers have a compile-time VA_BITS, and we support systems where there is no memory below
1<<VA_BITS. (crazy huh!). Picking on AMD-Seattle again: if you boot a 4K 39bit VA kernel,
the idmap will have more page table levels than the page table helpers can build. This is
why there are special helpers to load the idmap, and twiddle TCR_EL1.T0SZ.
You already need to copy the linear-map, so using an idmap is extra work. You want to work
with linear-map addresses, you probably need to add the field to the appropriate structure.

The kexec relocation code still runs at EL2. You can't use a copy of the linear map here
as there is only one TTBR on v8.0, and you'd need to setup EL2 as its been torn back to
the hyp-stub. This is the reason hibernate posts EL2 in a holding pen while it rewrites
all of memory, then calls back to fixup EL2. Keeping the rewrite phase at EL1 means it
doesn't need independently tweaking/testing. You need to do something similar, either
calling EL2 to start the new image, or disabling the MMU at EL1 to start the new image there.

You will need to alter the relocation code to do nothing for kdump, as no relocation is
required and building page-tables is extra work where the kernel may croak, preventing us
from reaching kdump.

Finally, having this independent idmap machinery isn't desirable from a maintenance
perspective. Please start with the hibernate code that already solves a very similar
problem, as it already has most of these problems covered.

> This patch series works in terms, that I can kexec-reboot both in QEMU

I wouldn't expect Qemu's emulation of the MMU and caches to be performance accurate.

> and on a physical machine. However, I do not see performance improvement
> during relocation. The performance is just as slow as before with disabled
> caches.

> Am I missing something? Perhaps, there is some flag that I should also
> enable in page table? Please provide me with any suggestions.

Some information about the physical machine you tested this on would help.
I'm guessing its v8.0, and booted at EL2....

Thanks,

James

Next message: Johannes Weiner: "Re: [PATCH 2/2] mm/memcontrol: split local and nested atomic vmstats/vmevents counters"
Previous message: Heiko Stübner: "Re: [PATCH v1 1/1] arm64: dts: rockchip: Add support for TB-96AI board"
In reply to: Pavel Tatashin: "Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation"
Next in thread: Pavel Tatashin: "Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]