Re: [PATCH 00/10] RFC: userfault

From: Andy Lutomirski
Date: Wed Jul 02 2014 - 21:51:21 EST


On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Hello everyone,
>
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.

cc:linux-api -- this is certainly worthy of linux-api discussion.

>
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
>
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
>
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
>
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
>
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
>
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
>
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
>
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
>
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
>
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
>
> The code can be found here:
>
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault
>
> The branch is rebased so you can get updates for example with:
>
> git fetch && git checkout -f origin/userfault
>
> Comments welcome, thanks!
> Andrea
>
> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
>
> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> ---
> include/syscalls-x86_64.h | 2 +
> syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
> syscalls/syscalls.h | 2 +
> syscalls/userfaultfd.c | 12 ++++++
> 4 files changed, 116 insertions(+)
> create mode 100644 syscalls/remap_anon_pages.c
> create mode 100644 syscalls/userfaultfd.c
>
> diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
> index e09df43..a5b3a88 100644
> --- a/include/syscalls-x86_64.h
> +++ b/include/syscalls-x86_64.h
> @@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
> { .entry = &syscall_sched_setattr },
> { .entry = &syscall_sched_getattr },
> { .entry = &syscall_renameat2 },
> + { .entry = &syscall_remap_anon_pages },
> + { .entry = &syscall_userfaultfd },
> };
> diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
> new file mode 100644
> index 0000000..b1e9d3c
> --- /dev/null
> +++ b/syscalls/remap_anon_pages.c
> @@ -0,0 +1,100 @@
> +/*
> + * SYSCALL_DEFINE3(remap_anon_pages,
> + unsigned long, dst_start, unsigned long, src_start,
> + unsigned long, len)
> + */
> +#include <stdlib.h>
> +#include <asm/mman.h>
> +#include <assert.h>
> +#include "arch.h"
> +#include "maps.h"
> +#include "random.h"
> +#include "sanitise.h"
> +#include "shm.h"
> +#include "syscall.h"
> +#include "tables.h"
> +#include "trinity.h"
> +#include "utils.h"
> +
> +static const unsigned long alignments[] = {
> + 1 * MB, 2 * MB, 4 * MB, 8 * MB,
> + 10 * MB, 100 * MB,
> +};
> +
> +static unsigned char *g_src, *g_dst;
> +static unsigned long g_size;
> +static int g_check;
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +static void sanitise_remap_anon_pages(struct syscallrecord *rec)
> +{
> + unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
> + unsigned long max_rand;
> + if (rand_bool()) {
> + g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
> + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> + } else
> + g_src = MAP_FAILED;
> + if (rand_bool()) {
> + g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
> + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> + } else
> + g_dst = MAP_FAILED;
> + g_size = size;
> + g_check = 1;
> +
> + rec->a1 = (unsigned long) g_dst;
> + rec->a2 = (unsigned long) g_src;
> + rec->a3 = g_size;
> + rec->a4 = 0;
> +
> + if (rand_bool())
> + max_rand = -1UL;
> + else
> + max_rand = g_size << 1;
> + if (rand_bool()) {
> + rec->a3 += (rand() % max_rand) - g_size;
> + g_check = 0;
> + }
> + if (rand_bool()) {
> + rec->a1 += (rand() % max_rand) - g_size;
> + g_check = 0;
> + }
> + if (rand_bool()) {
> + rec->a2 += (rand() % max_rand) - g_size;
> + g_check = 0;
> + }
> + if (rand_bool()) {
> + if (rand_bool()) {
> + rec->a4 = rand();
> + } else
> + rec->a4 = RAP_ALLOW_SRC_HOLES;
> + }
> + if (g_src != MAP_FAILED)
> + memset(g_src, 0xaa, size);
> +}
> +
> +static void post_remap_anon_pages(struct syscallrecord *rec)
> +{
> + if (g_check && !rec->retval) {
> + unsigned long size = g_size;
> + unsigned char *dst = g_dst;
> + while (size--)
> + assert(dst[size] == 0xaaU);
> + }
> + munmap(g_src, g_size);
> + munmap(g_dst, g_size);
> +}
> +
> +struct syscallentry syscall_remap_anon_pages = {
> + .name = "remap_anon_pages",
> + .num_args = 4,
> + .arg1name = "dst_start",
> + .arg2name = "src_start",
> + .arg3name = "len",
> + .arg4name = "flags",
> + .group = GROUP_VM,
> + .sanitise = sanitise_remap_anon_pages,
> + .post = post_remap_anon_pages,
> +};
> diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
> index 114500c..b8eaa63 100644
> --- a/syscalls/syscalls.h
> +++ b/syscalls/syscalls.h
> @@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
> extern struct syscallentry syscall_sched_getattr;
> extern struct syscallentry syscall_renameat2;
> extern struct syscallentry syscall_kern_features;
> +extern struct syscallentry syscall_remap_anon_pages;
> +extern struct syscallentry syscall_userfaultfd;
> diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
> new file mode 100644
> index 0000000..769fe78
> --- /dev/null
> +++ b/syscalls/userfaultfd.c
> @@ -0,0 +1,12 @@
> +/*
> + * SYSCALL_DEFINE1(userfaultfd, int, flags)
> + */
> +#include "sanitise.h"
> +
> +struct syscallentry syscall_userfaultfd = {
> + .name = "userfaultfd",
> + .num_args = 1,
> + .arg1name = "flags",
> + .arg1type = ARG_LEN,
> + .rettype = RET_FD,
> +};
>
>
> Andrea Arcangeli (10):
> mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
> mm: madvise MADV_USERFAULT
> mm: PT lock: export double_pt_lock/unlock
> mm: rmap preparation for remap_anon_pages
> mm: swp_entry_swapcount
> mm: sys_remap_anon_pages
> waitqueue: add nr wake parameter to __wake_up_locked_key
> userfaultfd: add new syscall to provide memory externalization
> userfaultfd: make userfaultfd_write non blocking
> userfaultfd: use VM_FAULT_RETRY in handle_userfault()
>
> arch/alpha/include/uapi/asm/mman.h | 3 +
> arch/mips/include/uapi/asm/mman.h | 3 +
> arch/parisc/include/uapi/asm/mman.h | 3 +
> arch/x86/syscalls/syscall_32.tbl | 2 +
> arch/x86/syscalls/syscall_64.tbl | 2 +
> arch/xtensa/include/uapi/asm/mman.h | 3 +
> fs/Makefile | 1 +
> fs/proc/task_mmu.c | 5 +-
> fs/userfaultfd.c | 593 +++++++++++++++++++++++++++++++++
> include/linux/huge_mm.h | 11 +-
> include/linux/ksm.h | 4 +-
> include/linux/mm.h | 5 +
> include/linux/mm_types.h | 2 +-
> include/linux/swap.h | 6 +
> include/linux/syscalls.h | 5 +
> include/linux/userfaultfd.h | 42 +++
> include/linux/wait.h | 5 +-
> include/uapi/asm-generic/mman-common.h | 3 +
> init/Kconfig | 10 +
> kernel/sched/wait.c | 7 +-
> kernel/sys_ni.c | 2 +
> mm/fremap.c | 506 ++++++++++++++++++++++++++++
> mm/huge_memory.c | 209 ++++++++++--
> mm/ksm.c | 2 +-
> mm/madvise.c | 19 +-
> mm/memory.c | 14 +
> mm/mremap.c | 2 +-
> mm/rmap.c | 9 +
> mm/swapfile.c | 13 +
> net/sunrpc/sched.c | 2 +-
> 30 files changed, 1447 insertions(+), 46 deletions(-)
> create mode 100644 fs/userfaultfd.c
> create mode 100644 include/linux/userfaultfd.h
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/