Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation

From: Jason Gunthorpe
Date: Fri Apr 04 2025 - 10:30:54 EST


On Fri, Apr 04, 2025 at 04:53:13PM +0300, Mike Rapoport wrote:
> > Maybe change the reserved regions code to put the region list in a
> > folio and preserve the folio instead of using FDT as a "demo" for the
> > functionality.
>
> Folios are not available when we restore reserved regions, this just won't
> work.

You don't need the folio at that point, you just need the data in the
page.

The folio would be freed after starting up the buddy allocator.

> > We know what the future use case is for the folio preservation, all
> > the drivers and the iommu are going to rely on this.
>
> We don't know how much of the preservation will be based on folios.

I think almost all of it. Where else does memory come from for drivers?

> Most drivers do not use folios

Yes they do, either through kmalloc or through alloc_page/etc. "folio"
here is just some generic word meaning memory from the buddy allocator.

The big question on my mind is if we need a way to preserve slab
objects as well..

> and for preserving memfd* and hugetlb we'd need to have some dance
> around that memory anyway.

memfd is all folios - what do you mean?

hugetlb is moving toward folios.. eg guestmemfd is supposed to be
taking the hugetlb special stuff and turning it into folios.

> So I think kho_preserve_folio() would be a part of the fdbox or
> whatever that functionality will be called.

It is part of KHO. Preserving the folios has to be sequenced with
starting the buddy allocator, and that is KHO's entire responsibility.

I could see something like preserving slab being in a different layer,
built on preserving folios.

> Are they?
> The purpose of basic KHO is to make sure the memory we want to preserve is
> not trampled over. Preserving folios with their orders means we need to
> make sure memory range of the folio is preserved and we carry additional
> information to actually recreate the folio object, in case it is needed and
> in case it is possible. Hughetlb, for instance has its own way initializing
> folios and just keeping the order won't be enough for that.

I expect many things will need a side-car datastructure to record that
additional meta-data. hugetlb can start with folios, then switch them
over to its non-folio stuff based on its metadata.

The point is the basic low level KHO mechanism is simple folios -
memory from the buddy allocator with an neutral struct folio that the
caller can then customize to its own memory descriptor type on restore.

Eventually restore would allocate a caller specific memdesc and it
wouldn't be "folios" at all. We just don't have the right words yet to
describe this.

> As for the optimizations of memblock reserve path, currently it what hurts
> the most in my and Pratyush experiments. They are not very representative,
> but still, preserving lots of pages/folios spread all over would have it's
> toll on the mm initialization.

> And I don't think invasive changes to how
> buddy and memory map initialization are the best way to move forward and
> optimize that.

I'm pretty sure this is going to be the best performance path, but I
have no idea how invasive it would be to the buddy alloactor to make
it work.

> Quite possibly we'd want to be able to minimize amount of *ranges*
> that we preserve.

I'm not sure, that seems backwards to me, we really don't want to have
KHO mem zones! So I think optimizing for, and thinking about ranges
doesn't make sense.

The big ranges will arise naturally beacuse things like hugetlb
reservations should all be contiguous and the resulting folios should
all be allocated for the VM and also all be contigous. So vast, vast
amounts of memory will be high order and contiguous.

> Preserving folio orders with it is really straighforward and until we see
> some real data of how the entire KHO machinery is used, I'd prefer simple
> over anything else.

mapletree may not even work as it has a very high bound on memory
usage if the preservation workload is small and fragmented. This is
why I didn't want to use list of ranges in the first place.

It also doesn't work so well if you need to preserve the order too :\

Until we know the workload(s) and cost how much memory the maple tree
version will use I don't think it is a good general starting point.

Jason