Re: [RFC PATCH 1/5] misc: introduce FDBox

From: Pratyush Yadav
Date: Wed Mar 19 2025 - 09:35:48 EST


On Tue, Mar 18 2025, Jason Gunthorpe wrote:

> On Tue, Mar 18, 2025 at 11:02:31PM +0000, Pratyush Yadav wrote:
>
>> I suppose we can serialize all FDs when the box is sealed and get rid of
>> the struct file. If kexec fails, userspace can unseal the box, and FDs
>> will be deserialized into a new struct file. This way, the behaviour
>> from userspace perspective also stays the same regardless of whether
>> kexec went through or not. This also helps tie FDBox closer to KHO.
>
> I don't think we can do a proper de-serialization without going
> through kexec. The new stuff Mike is posting for preserving memory
> will not work like that.

Why not? If the next kernel can restore the file from the serialized
content, so can the current kernel. What stops this from working with
the new memory preservation scheme (which I assume is the idea you
proposed in [0])? In that, kho_preserve_folio() marks a page to be
preserved across KHO. We can have a kho_restore_folio() function that
removes the reservation from the xarray and returns the folio to the
caller. The KHO machinery takes care of abstracting the detail of
whether kexec actually happened. With that in place, I don't see why we
can't deserialize without going through kexec.

>
> I think error recovery wil have to work by just restoring access to
> the FD and it's driver state that was never actually destroyed.
>
>> > It sure would be nice if the freezing process could be managed
>> > generically somehow.
>> >
>> > One option for freezing would have the kernel enforce that userspace
>> > has closed and idled the FD everywhere (eg check the struct file
>> > refcount == 1). If userspace doesn't have access to the FD then it is
>> > effectively frozen.
>>
>> Yes, that is what I want to do in the next revision. FDBox itself will
>> not close the file descriptors when you put a FD in the box. It will
>> just grab a reference and let the userspace close the FD. Then when the
>> box is sealed, the operation can be refused if refcount != 1.
>
> I'm not sure about this sealed idea..
>
> One of the design points here was to have different phases for the KHO
> process and we want to shift alot of work to the earlier phases. Some
> of that work should be putting things into the fdbox, freezing them,
> and writing out the serialzation as that may be quite time consuming.
>
> The same is true for the deserialize step where we don't want to bulk
> deserialize but do it in an ordered way to minimize the critical
> downtime.
>
> So I'm not sure if a 'seal' operation that goes and bulk serializes
> everything makes sense. I still haven't seen a state flow chart and a
> proposal where all the different required steps would have to land to
> get any certainty here.

The seal operation does bulk serialize/deserialize for _one_ box. You
can have multiple boxes and distribute your FDs in the boxes based on
the serialize or deserialize order you want. Userspace decides when to
seal or unseal a particular box, which gives it full control over the
order in which things happen.

>
> At least in my head I imagined you'd open the KHO FD, put it in
> serializing mode and then go through in the right order pushing all
> the work and building the serializion data structure as you go.

If we serialize the box at seal time, this is exactly how things will be
done. Before KHO activate happens, userspace can start putting in FDs
and start serializing things. Then when activation happens, the
box-level metadata gets quickly written out to the main FDT and that's
it. The bulk of the per-fd work should already be done.

We can even have something like FDBOX_PREPARE_FD or FDBOX_PREPARE_BOX
that pre-serializes as much as it can before anything is actually
frozen, so the actual freeze is faster. This is similar to pre-copy
during live migration for example.

All of this is made easier if each component has its own FDT (or any
other data structure) and doesn't have to share the same FDT. This is
the direction we are going in anyway with the next KHO versions.

>
> At the very end you'd finalize the KHO serialization, which just
> writes out a little bit more to the FDT and gives you back the FDT
> blob for the kexec. It should be a very fast operation.
>
> Jason
>

[0] https://lore.kernel.org/lkml/20250212152336.GA3848889@xxxxxxxxxx/

--
Regards,
Pratyush Yadav