Re: [RFC PATCH 1/5] misc: introduce FDBox

From: Pratyush Yadav
Date: Fri Mar 07 2025 - 19:10:47 EST


Hi Christian,

Thanks for the review!

On Fri, Mar 07 2025, Christian Brauner wrote:

> On Fri, Mar 07, 2025 at 12:57:35AM +0000, Pratyush Yadav wrote:
>> The File Descriptor Box (FDBox) is a mechanism for userspace to name
>> file descriptors and give them over to the kernel to hold. They can
>> later be retrieved by passing in the same name.
>>
>> The primary purpose of FDBox is to be used with Kexec Handover (KHO).
>> There are many kinds anonymous file descriptors in the kernel like
>> memfd, guest_memfd, iommufd, etc. that would be useful to be preserved
>> using KHO. To be able to do that, there needs to be a mechanism to label
>> FDs that allows userspace to set the label before doing KHO and to use
>> the label to map them back after KHO. FDBox achieves that purpose by
>> exposing a miscdevice which exposes ioctls to label and transfer FDs
>> between the kernel and userspace. FDBox is not intended to work with any
>> generic file descriptor. Support for each kind of FDs must be explicitly
>> enabled.
>
> This makes no sense as a generic concept. If you want to restore shmem
> and possibly anonymous inodes files via KHO then tailor the solution to
> shmem and anon inodes but don't make this generic infrastructure. This
> has zero chances to cover generic files.
>
> As soon as you're dealing with non-kernel internal mounts that are not
> guaranteed to always be there or something that depends on superblock or
> mount specific information that can change you're already screwed. This
> will end up a giant mess. This is not supportable or maintainable.

As Jason mentioned, this it _not_ intended to be a generic concept. My
documentation also says that, but perhaps that was not clear enough. It
is supposed to work with only specific type of file descriptors that
explicitly enable support for it in the context of kexec handover. I
think it might be a good idea to have an explicit dependency on KHO so
this distinction is a bit clearer.

It is also not intended to be completely transparent to userspace, where
they magically get their FD back exactly as they put in. I think we
should limit the amount of state we want to guarantee, since it directly
contributes to our ABI exposure to later kernels. The more state we
track, the more complex and inflexible our ABI becomes. So use of this
very much needs an enlightened userspace.

As an example, with memfd, the main purpose of persistence across kexec
is its memory contents. The application needs a way to carry its memory
across, and we provide it a mechanism to do so via FDBox. So we only
guarantee that the memory contents are preserved, along with some small
metadata, instead of the whole inode which is a lot more complex. The
application would then need to be aware of it and expect such changes.

The idea is also _not_ to have all the FDs of userspace into the box.
They should only put in the ones that they specifically need, and
re-open the rest normally. For example, in a live update scenario, the
VMM can put in the guest_memfds, the iommufds, and so on, but then
re-open configuration files or VM metadata via the normal path.

>
> And struct file should have zero to do with this KHO stuff. It doesn't
> need to carry new operations and it doesn't need to waste precious space
> for any of this.

That is a fair point. The main reason I did it this way is because memfd
does not have file_operations of its own. To enable the memfd
abstraction, I wanted the FDBox callback to go into memfd, and it can
pass it down to shmem or hugetlbfs. Having the pointer in struct file
makes that easy.

I think I can find ways to make it work via file_operations, so I will
do that in the next version.

>
>>
>> While the primary purpose of FDBox is to be used with KHO, it does not
>> explicitly require CONFIG_KEXEC_HANDOVER, since it can be used without
>> KHO, simply as a way to preserve or transfer FDs when userspace exits.
>
> This use-case is covered with systemd's fdstore and it's available to
> unprivileged userspace. Stashing arbitrary file descriptors in the
> kernel in this way isn't a good idea.

For one, it can't be arbitrary FDs, but only explicitly enabled ones.
Beyond that, while not intended, there is no way to stop userspace from
using it as a stash. Stashing FDs is a needed operation for this to
work, and there is no way to guarantee in advance that userspace will
actually use it for KHO, and not just stash it to grab back later.

I think at least having an explicit dependency on CONFIG_KEXEC_HANDOVER
can help a bit at least.

[...]
>> +
>> + ret = close_fd(put_fd.fd);
>> + if (ret) {
>> + struct fdbox_fd *del;
>> +
>> + del = fdbox_remove_fd(box, put_fd.name);
>> + /*
>> + * If we fail to remove from list, it means someone else took
>> + * the FD out. In that case, they own the refcount of the file
>> + * now.
>> + */
>> + if (del == box_fd)
>> + fput(file);
>
> This is a racy mess. Why would adding a file to an fdbox be coupled with
> closing it concpetually? The caller should close the file descriptor
> itself and not do this close_fd() here in the kernel.

Makes sense. We can make it a requirement to have all open FDs of the
file be closed by userspace before the box can be sealed.

>
>> +
>> + return ret;
>> + }
>> +
>> + return 0;
>> +}
>> +
[...]
>> +static long box_fops_unl_ioctl(struct file *filep,
>> + unsigned int cmd, unsigned long arg)
>> +{
>> + struct fdbox *box = filep->private_data;
>> + long ret = -EINVAL;
>> +
>> + if (!capable(CAP_SYS_ADMIN))
>> + return -EPERM;
>> +
>> + switch (cmd) {
>> + case FDBOX_PUT_FD:
>> + ret = fdbox_put_fd(box, arg);
>> + break;
>> + case FDBOX_UNSEAL:
>> + ret = fdbox_unseal(box);
>> + break;
>> + case FDBOX_SEAL:
>> + ret = fdbox_seal(box);
>> + break;
>> + case FDBOX_GET_FD:
>> + ret = fdbox_get_fd(box, arg);
>> + break;
>
> How does userspace know what file descriptors are in this fdbox if only
> put and get are present? Userspace just remembers the names and
> otherwise it simply leaks files that no one remembered?

For now, it is supposed to remember that, but having a FDBOX_LIST
operation should be simple enough. Will add that in the next revision.

Also, if userspace suspects it forgot some, it can always delete the box
to clean up the leftover ones.

>
>> + default:
>> + ret = -EINVAL;
>> + break;
>> + }
>> +
>> + return ret;
>> +}
>> +
[...]

--
Regards,
Pratyush Yadav