Re: [RFC PATCH] Introduce persistent memory pool

From: Gowans, James
Date: Fri Aug 25 2023 - 09:34:07 EST


On Fri, 2023-08-25 at 10:05 +0200, Greg Kroah-Hartman wrote:

Thanks for adding me to this thread Greg!

> On Tue, Aug 22, 2023 at 11:34:34AM -0700, Stanislav Kinsburskii wrote:
> > This patch addresses the need for a memory allocator dedicated to
> > persistent memory within the kernel. This allocator will preserve
> > kernel-specific states like DMA passthrough device states, IOMMU state, and
> > more across kexec.
> > The proposed solution offers a foundational implementation for potential
> > custom solutions that might follow. Though the implementation is
> > intentionally kept concise and straightforward to foster discussion and
> > feedback, it's fully functional in its current state.

Hi Stanislav, it looks like we're working on similar things. I'm looking
to develop a mechanism to support hypervisor live update for when KVM is
running VMs with PCI device passthrough. VMs with device passthrough
also necessitates passing and re-hydrating IOMMU state so that DMA can
continue during live update.

Planning on having an LPC session on this topic:
https://lpc.events/event/17/abstracts/1629/ (currently it's only a
submitted abstract so not sure if visible, hopefully it will be soon).

We are looking at implementing persistence across kexec via an in-memory
filesystem on top of reserved memory. This would have files for anything
that needs to be persisted. That includes files for IOMMU pgtables, for
guest memory or userspace-accessible memory.

It may be nice to solve all kexec persistence requirements with one
solution, but we can consider IOMMU separately. There are at least three
ways that this can be done:
a) carving out reserved memory for pgtables. This is done by your
proposal here, as well as my suggestion of a filesystem.
b) pre/post kexec hooks for drivers to serialise state and pass it
across in a structured format from old to new kernel.
c) Reconstructing IOMMU state in the new kernel by starting at the
hardware registers and walking the page tables. No state passing needed.

Have you considered option (b) and (c) here? One of the implications of
(b) and (c) are that they would need to hook into the buddy allocator
really early to be able to carve out the reconstructed page tables
before the allocator is used. Similar to how pkram [0] hooks in early to
carve out pages used for its filesystem.

>
> >
> > Potential applications include:
> >
> > 1. Allowing various in-kernel entities to allocate persistent pages from
> > a singular memory pool, eliminating the need for multiple region
> > reservations.
> >
> > 2. For in-kernel components that require the allocation address to be
> > available on kernel kexec, this address can be exposed to user space and
> > then passed via the command line.

Do you have specific examples of other state that needs to be passed
across? Trying to see whether tailoring specifically to the IOMMU case
is okay. Conceptually IOMMU state can be reconstructed starting with
hardware registers, not needing reserved memory. Other use-cases may not
have this option.

>
> As you have no in-kernel users of this, it's not something we can even
> consider at the moment for obvious reasons (neither would you want us
> to.)
>
> Can you make this part of a patch series that actually adds a user,
> probably more than one, so that we can see if any of this even makes
> sense?

I'm very keen to see this as well. The way that the IOMMU drivers are
enlightened to hook into your memory pool will likely be similar to how
they would hook into my proposal of an in-memory filesystem.
Do you have code available showing the IOMMU integration?

>
> > drivers/misc/Kconfig | 7 +
> > drivers/misc/Makefile | 1
> > drivers/misc/pmpool.c | 270 ++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/pmpool.h | 20 ++++
> > 4 files changed, 298 insertions(+)
> > create mode 100644 drivers/misc/pmpool.c
> > create mode 100644 include/linux/pmpool.h
>
> misc is not for memory pools, as this is not a driver. please put this
> in the properly location instead of trying to hide it from the mm
> maintainers and subsystem :)

One of the reasons I thought a proper filesystem would be a better way
of exposing this functionality.

JG


[0]
https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@xxxxxxxxxx/T/