Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas

From: Gowans, James
Date: Mon Oct 07 2024 - 04:42:00 EST


On Wed, 2024-10-02 at 15:55 -0300, Jason Gunthorpe wrote:
> On Mon, Sep 16, 2024 at 01:30:54PM +0200, James Gowans wrote:
> > Now actually implementing the serialise callback for iommufd.
> > On KHO activate, iterate through all persisted domains and write their
> > metadata to the device tree format. For now just a few fields are
> > serialised to demonstrate the concept. To actually make this useful a
> > lot more field and related objects will need to be serialised too.
>
> But isn't that a rather difficult problem? The "a lot more fields"
> include things like pointers to the mm struct, the user_struct and
> task_struct, then all the pinning accounting as well.
>
> Coming work extends this to memfds and more is coming. I would expect
> this KHO stuff to use the memfd-like path to access the physical VM
> memory too.
>
> I think expecting to serialize and restore everything like this is
> probably much too complicated.

On reflection I think you're right - this will be complex both from a
development and a maintenance perspective, trying to make sure we
serialise all the necessary state and reconstruct it correctly. Even
more complex when structs are refactored/changed across kernel versions.
An important requirement of this functionality is the ability to kexec
between different kernel versions including going back to an older
kernel version in the case of a rollback.

So, let's look at other options:

>
> If you could just retain a small portion and then directly reconstruct
> the missing parts it seems like it would be more maintainable.

I think we have two other possible approaches here:

1. What this RFC is sketching out, serialising fields from the structs
and setting those fields again on deserialise. As you point out this
will be complicated.

2. Get userspace to do the work: userspace needs to re-do the ioctls
after kexec to reconstruct the objects. My main issue with this approach
is that the kernel needs to do some sort of trust but verify approach to
ensure that userspace constructs everything the same way after kexec as
it was before kexec. We don't want to end up in a state where the
iommufd objects don't match the persisted page tables.

3. Serialise and reply the ioctls. Ioctl APIs and payloads should
(must?) be stable across kernel versions. If IOMMUFD records the ioctls
executed by userspace then it could replay them as part of deserialise
and give userspace a handle to the resulting objects after kexec. This
way we are guaranteed consistent iommufd / IOAS objects. By "consistent"
I mean they are the same as before kexec and match the persisted page
tables. By having the kernel do this it means it doesn't need to depend
on userspace doing the correct thing.

What do you think of this 3rd approach? I can try to sketch it out and
send another RFC if you think it sounds reasonable.

>
> Ie "recover" a HWPT from a KHO on a manually created a IOAS with the
> right "memfd" for the backing storage. Then the recovery can just
> validate that things are correct and adopt the iommu_domain as the
> hwpt.

This sounds more like option 2 where we expect userspace to re-drive the
ioctls, but verify that they have corresponding payloads as before kexec
so that iommufd objects are consistent with persisted page tables.
If the kernel is doing verification wouldn't it be better for the kernel
to do the ioctl work itself and give the resulting objects to userspace?

>
> Eventually you'll want this to work for the viommus as well, and that
> seems like a lot more tricky complexity..
>
> Jason