[PATCH 00/21] RFC: userfaultfd v3

From: Andrea Arcangeli
Date: Thu Mar 05 2015 - 12:23:56 EST


Hello everyone,

This is a RFC for the userfaultfd syscall API v3 that addresses the
feedback received for the previous v2 submit.

The main change from the v2 is that MADV_USERFAULT/NOUSERFAULT
disappeared (they're replaced by the UFFDIO_REGISTER/UNREGISTER
ioctls). In short userfaults are now only possible through the
userfaultfd. The remap_anon_pages syscall also disappeared replaced by
the UFFDIO_REMAP ioctl which is in turn mostly obsoleted by the newer
UFFDIO_COPY and UFFDIO_ZEROPAGE ioctls that are indeed more efficient
by never having to flush the TLB. The suggestion to copy the data
instead of moving it, in order to resolve the userfault, was
immediately agreed.

The latest code can also be cloned here:

git clone --reference linux -b userfault git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git


Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control on
various types of page faults.

For example userfaults allows a proper and more optimal implementation
of the PROT_NONE+SIGSEGV trick.

There has been interest from multiple users for different use cases:

1) KVM postcopy live migration (one form of cloud memory
externalization). KVM postcopy live migration is the primary driver
of this work:
http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html
)

2) KVM postcopy live snapshotting (allowing to limit/throttle the
memory usage, unlike fork would, plus the avoidance of fork
overhead in the first place).

The syscall API is already contemplating the wrprotect fault
tracking and it's generic enough to allow its later implementation
in a backwards compatible fashion.

3) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
should be extended to work also on tmpfs and then the
uffdio_register.ioctls will notify userland that UFFDIO_COPY is
available even when the registered virtual memory range is tmpfs
backed.

4) alternate mechanism to notify web browsers or apps on embedded
devices that volatile pages have been reclaimed. This basically
avoids the need to run a syscall before the app can access with the
CPU the virtual regions marked volatile. This also requires point 3)
to be fulfilled, as volatile pages happily apply to tmpfs.

5) postcopy live migration of binaries inside linux containers.

Even though there wasn't a real use case requesting it yet, the new
API also allows to implement distributed shared memory in a way that
readonly shared mappings can exist simultaneously in different hosts
and they can be become exclusive at the first wrprotect fault.

The UFFDIO_REMAP method is still present in the patchset but it's
provided primarily to remove (add not) memory from the userfault
range. The addition of the UFFDIO_REMAP method is intentionally kept
at the end of the patchset. The postcopy live migration qemu code will
only use UFFDIO_COPY and UFFDIO_ZEROPAGE. UFFDIO_REMAP isn't intended
to be merged upstream in the short term, and it can be dropped later
if there's an agreement it's a bad idea to keep it around in the
patchset.

David run some KVM postcopy live migration benchmarks on a 8-way CPU
system and he measured that using UFFDIO_COPY instead of UFFDIO_REMAP
resulted in a roughly a -20% reduction in latency which is good. The
standard deviation error on the latency measurement decreased
significantly as well (because the number of CPUs that required IPI
delivery was variable, while the copy always takes roughly the same
time). A bigger improvement is expectable if measured on a larger host
with more CPUs.

All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy
live migration and the UFFD can be passed to a manager process through
unix domain sockets to satisfy point 5).

I look forward to discuss this further next week at the LSF/MM
summit, if you're attending the summit see you soon!

Comments welcome, thanks,
Andrea

Credits: partially funded by the Orbit EU project.

PS. There is one TODO detail worth mentioning for completeness that
affects usage 2) and UFFDIO_REMAP if used to remove memory from the
userfault range: handle_userfault() is only effective if
FAULT_FLAG_ALLOW_RETRY is set... but that is only set at the first
attempted page fault. If by accident some thread was already faulting
in the range and the first page fault attempt returned VM_FAULT_RETRY
and UFFDIO_REMAP or UFFDIO_WP jumps in to arm the userfault just
before the second attempt starts, a SIGBUS would be raised by the page
fault. Stopping all thread access to the userfault ranges during
UFFDIO_REMAP/WP while possible, isn't optimal. Currently (excluding
real filebacked mappings and handle_userfault() itself which is
clearly no problem) only tmpfs or a swapin can return
VM_FAULT_RETRY. To close this SIGBUS window for all usages, the
simplest solution would be that if FAULT_FLAG_TRIED is set
VM_FAULT_RETRY can still be returned (but only by handle_userfault
that has a legitimate reason for insisting a second time in a row with
VM_FAULT_RETRY). That would require some change to the FAULT_FLAG
semantics. Again userland could cope with this detail but it'd be
inefficient to solve it in userland. This would be a fully backwards
compatible change and it's only strictly required by the wrprotect
tracking mode, so it's no problem to solve this later. Because of its
inherent racy nature, nobody could possibly depend on a racy SIGBUS
being raised now, when it won't be raised anymore later.

Andrea Arcangeli (21):
userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
userfaultfd: linux/Documentation/vm/userfaultfd.txt
userfaultfd: uAPI
userfaultfd: linux/userfaultfd_k.h
userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
userfaultfd: call handle_userfault() for userfaultfd_missing() faults
userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
userfaultfd: prevent khugepaged to merge if userfaultfd is armed
userfaultfd: add new syscall to provide memory externalization
userfaultfd: buildsystem activation
userfaultfd: activate syscall
userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE
preparation
userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
userfaultfd: remap_pages: rmap preparation
userfaultfd: remap_pages: swp_entry_swapcount() preparation
userfaultfd: UFFDIO_REMAP uABI
userfaultfd: remap_pages: UFFDIO_REMAP preparation
userfaultfd: UFFDIO_REMAP
userfaultfd: add userfaultfd_wp mm helpers

Documentation/ioctl/ioctl-number.txt | 1 +
Documentation/vm/userfaultfd.txt | 97 +++
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/asm/unistd.h | 2 +-
arch/powerpc/include/uapi/asm/unistd.h | 1 +
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/Makefile | 1 +
fs/userfaultfd.c | 1128 ++++++++++++++++++++++++++++++++
include/linux/mm.h | 4 +-
include/linux/mm_types.h | 11 +
include/linux/swap.h | 6 +
include/linux/syscalls.h | 1 +
include/linux/userfaultfd_k.h | 112 ++++
include/linux/wait.h | 5 +-
include/uapi/linux/userfaultfd.h | 150 +++++
init/Kconfig | 11 +
kernel/fork.c | 3 +-
kernel/sched/wait.c | 7 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 1 +
mm/huge_memory.c | 217 +++++-
mm/madvise.c | 3 +-
mm/memory.c | 16 +
mm/mempolicy.c | 4 +-
mm/mlock.c | 3 +-
mm/mmap.c | 39 +-
mm/mprotect.c | 3 +-
mm/rmap.c | 9 +
mm/swapfile.c | 13 +
mm/userfaultfd.c | 793 ++++++++++++++++++++++
net/sunrpc/sched.c | 2 +-
32 files changed, 2593 insertions(+), 54 deletions(-)
create mode 100644 Documentation/vm/userfaultfd.txt
create mode 100644 fs/userfaultfd.c
create mode 100644 include/linux/userfaultfd_k.h
create mode 100644 include/uapi/linux/userfaultfd.h
create mode 100644 mm/userfaultfd.c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/