[PATCH V2 0/9] Use copy_process/create_io_thread in vhost layer
From: Mike Christie
Date: Tue Sep 21 2021 - 17:53:06 EST
The following patches were made over Linus's tree but also apply over
Jens's 5.16 io_uring branch and Michaels' vhost/next branch.
This is version 2 of the patchset and should handle all the review
comments posted in V1 here:
https://lore.kernel.org/all/20210916212051.6918-1-michael.christie@xxxxxxxxxx/
If I missed a comment, please let me know.
This patchset allows the vhost layer to do a copy_process on the thread
that does the VHOST_SET_OWNER ioctl like how io_uring does a copy_process
against its userspace app (Jens, the patches make create_io_thread more
generic so that's why you are cc'd). This allows the vhost layer's worker
threads to inherit cgroups, namespaces, address space, etc and this worker
thread will also be accounted for against that owner/parent process's
RLIMIT_NPROC limit.
If you are not familiar with qemu and vhost here is more detailed
problem description:
Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.
The problem with this approach is that we then have to add new functions/
args/functionality for every thing we want to inherit. I started doing
that here:
https://lkml.org/lkml/2021/6/23/1233
for the RLIMIT_NPROC check, but it seems it might be easier to just
inherit everything from the beginning, becuase I'd need to do something
like that patch several times. For example, the current approach does not
support cgroups v2 so commands like virsh emulatorpin do not work. The
qemu process can go over its RLIMIT_NPROC. And for future vhost interfaces
where we export the vhost thread pid we will want the namespace info.
V2:
- Rename kernel_copy_process to kernel_worker.
- Instead of exporting functions, make kernel_worker() a proper
function/API that does common work for the caller.
- Instead of adding new fields to kernel_clone_args for each option
make it flag based similar to CLONE_*.
- Drop unused completion struct in vhost.
- Fix compile warnings by merging vhost cgroup cleanup patch and
vhost conversion patch.