Re: [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt

From: Michael Kerrisk (man-pages)
Date: Fri Sep 11 2015 - 04:47:29 EST


On 05/14/2015 07:30 PM, Andrea Arcangeli wrote:
> Add documentation.

Hi Andrea,

I do not recall... Did you write a man page also for this new system call?

Thanks,

Michael


> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> ---
> Documentation/vm/userfaultfd.txt | 140 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 140 insertions(+)
> create mode 100644 Documentation/vm/userfaultfd.txt
>
> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
> new file mode 100644
> index 0000000..c2f5145
> --- /dev/null
> +++ b/Documentation/vm/userfaultfd.txt
> @@ -0,0 +1,140 @@
> += Userfaultfd =
> +
> +== Objective ==
> +
> +Userfaults allow the implementation of on-demand paging from userland
> +and more generally they allow userland to take control various memory
> +page faults, something otherwise only the kernel code could do.
> +
> +For example userfaults allows a proper and more optimal implementation
> +of the PROT_NONE+SIGSEGV trick.
> +
> +== Design ==
> +
> +Userfaults are delivered and resolved through the userfaultfd syscall.
> +
> +The userfaultfd (aside from registering and unregistering virtual
> +memory ranges) provides two primary functionalities:
> +
> +1) read/POLLIN protocol to notify a userland thread of the faults
> + happening
> +
> +2) various UFFDIO_* ioctls that can manage the virtual memory regions
> + registered in the userfaultfd that allows userland to efficiently
> + resolve the userfaults it receives via 1) or to manage the virtual
> + memory in the background
> +
> +The real advantage of userfaults if compared to regular virtual memory
> +management of mremap/mprotect is that the userfaults in all their
> +operations never involve heavyweight structures like vmas (in fact the
> +userfaultfd runtime load never takes the mmap_sem for writing).
> +
> +Vmas are not suitable for page- (or hugepage) granular fault tracking
> +when dealing with virtual address spaces that could span
> +Terabytes. Too many vmas would be needed for that.
> +
> +The userfaultfd once opened by invoking the syscall, can also be
> +passed using unix domain sockets to a manager process, so the same
> +manager process could handle the userfaults of a multitude of
> +different processes without them being aware about what is going on
> +(well of course unless they later try to use the userfaultfd
> +themselves on the same region the manager is already tracking, which
> +is a corner case that would currently return -EBUSY).
> +
> +== API ==
> +
> +When first opened the userfaultfd must be enabled invoking the
> +UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
> +a later API version) which will specify the read/POLLIN protocol
> +userland intends to speak on the UFFD. The UFFDIO_API ioctl if
> +successful (i.e. if the requested uffdio_api.api is spoken also by the
> +running kernel), will return into uffdio_api.features and
> +uffdio_api.ioctls two 64bit bitmasks of respectively the activated
> +feature of the read(2) protocol and the generic ioctl available.
> +
> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
> +register a memory range in the userfaultfd by setting the
> +uffdio_register structure accordingly. The uffdio_register.mode
> +bitmask will specify to the kernel which kind of faults to track for
> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
> +pages). The UFFDIO_REGISTER ioctl will return the
> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
> +userfaults on the range registered. Not all ioctls will necessarily be
> +supported for all memory types depending on the underlying virtual
> +memory backend (anonymous memory vs tmpfs vs real filebacked
> +mappings).
> +
> +Userland can use the uffdio_register.ioctls to manage the virtual
> +address space in the background (to add or potentially also remove
> +memory from the userfaultfd registered range). This means a userfault
> +could be triggering just before userland maps in the background the
> +user-faulted page.
> +
> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
> +atomically copies a page into the userfault registered range and wakes
> +up the blocked userfaults (unless uffdio_copy.mode &
> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
> +UFFDIO_COPY.
> +
> +== QEMU/KVM ==
> +
> +QEMU/KVM is using the userfaultfd syscall to implement postcopy live
> +migration. Postcopy live migration is one form of memory
> +externalization consisting of a virtual machine running with part or
> +all of its memory residing on a different node in the cloud. The
> +userfaultfd abstraction is generic enough that not a single line of
> +KVM kernel code had to be modified in order to add postcopy live
> +migration to QEMU.
> +
> +Guest async page faults, FOLL_NOWAIT and all other GUP features work
> +just fine in combination with userfaults. Userfaults trigger async
> +page faults in the guest scheduler so those guest processes that
> +aren't waiting for userfaults (i.e. network bound) can keep running in
> +the guest vcpus.
> +
> +It is generally beneficial to run one pass of precopy live migration
> +just before starting postcopy live migration, in order to avoid
> +generating userfaults for readonly guest regions.
> +
> +The implementation of postcopy live migration currently uses one
> +single bidirectional socket but in the future two different sockets
> +will be used (to reduce the latency of the userfaults to the minimum
> +possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
> +
> +The QEMU in the source node writes all pages that it knows are missing
> +in the destination node, into the socket, and the migration thread of
> +the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
> +ioctls on the userfaultfd in order to map the received pages into the
> +guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
> +
> +A different postcopy thread in the destination node listens with
> +poll() to the userfaultfd in parallel. When a POLLIN event is
> +generated after a userfault triggers, the postcopy thread read() from
> +the userfaultfd and receives the fault address (or -EAGAIN in case the
> +userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
> +by the parallel QEMU migration thread).
> +
> +After the QEMU postcopy thread (running in the destination node) gets
> +the userfault address it writes the information about the missing page
> +into the socket. The QEMU source node receives the information and
> +roughly "seeks" to that page address and continues sending all
> +remaining missing pages from that new page offset. Soon after that
> +(just the time to flush the tcp_wmem queue through the network) the
> +migration thread in the QEMU running in the destination node will
> +receive the page that triggered the userfault and it'll map it as
> +usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
> +was spontaneously sent by the source or if it was an urgent page
> +requested through an userfault).
> +
> +By the time the userfaults start, the QEMU in the destination node
> +doesn't need to keep any per-page state bitmap relative to the live
> +migration around and a single per-page bitmap has to be maintained in
> +the QEMU running in the source node to know which pages are still
> +missing in the destination node. The bitmap in the source node is
> +checked to find which missing pages to send in round robin and we seek
> +over it when receiving incoming userfaults. After sending each page of
> +course the bitmap is updated accordingly. It's also useful to avoid
> +sending the same page twice (in case the userfault is read by the
> +postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
> +thread).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/