Re: [PATCH 00/17] RFC: userfault v2

From: Andrea Arcangeli
Date: Thu Nov 20 2014 - 12:30:34 EST

Next message: Josh Poimboeuf: "Re: [PATCHv2 2/3] kernel: add support for live patching"
Previous message: SÃren Brinkmann: "Re: [PATCH 1/1] net: Xilinx: Deletion of unnecessary checks before two function calls"
In reply to: zhanghailiang: "Re: [PATCH 00/17] RFC: userfault v2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On Fri, Oct 31, 2014 at 12:39:32PM -0700, Peter Feiner wrote:
> On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote:
> > Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
> > we have to do this (block the write action), because we have to save the page before it
> > is dirtied by writing action. This is the difference, compared to pre-copy migration.
>
> Ah ha, I understand the difference now. I suppose that you have considered
> doing a traditional pre-copy migration (that is, passes over memory saving
> dirty pages, followed by a pause and a final dump of remaining dirty pages) to
> a file. Your approach has the advantage of having the VM pause time bounded by
> the time it takes to handle the userfault and do the write, as opposed to
> pre-copy migration which has a pause time bounded by the time it takes to do
> the final dump of dirty pages, which, in the worst case, is the time it takes
> to dump all of the guest memory!

It sounds really similar issue as live migration, one can implement a
precopy live snapshot, or a precopy+postcopy live snapshot or a pure
postcopy live snapshot.

The decision on the amount of precopy done before engaging postcopy
(zero passes, 1 pass, or more passes) would have similar tradeoffs
too, except instead of having to re-transmit the re-dirtied pages over
the wire, it would need to overwrite them to disk.

The more precopy passes, the longer it takes for the live snapshotting
process to finish and the more I/O there will be (for live migration it'd
be network bandwidth usage instead of amount of I/O), but the shortest
the postcopy runtime will be (and the shorter postcopy runtime is, the
fewer userfaults will end up triggering on writes, in turn reducing
the slowdown and the artificial fault latency introduced to the guest
runtime). But the more precopy passes the more overwriting will happen
during the "longer" precopy stage and the more overall load there will
be for the host (the otherwise idle part of the host).

For the postcopy live snapshot the wrprotect faults are quite
equivalent to the not-present faults of postcopy live migration logic.

> You could use the old fork & dump trick. Given that the guest's memory is
> backed by private VMA (as of a year ago when I last looked, is always the case
> for QEMU), you can have the kernel do the write protection for you.
> Essentially, you fork Qemu and, in the child process, dump the guest memory
> then exit. If the parent (including the guest) writes to guest memory, then it
> will fault and the kernel will copy the page.
>
> The fork & dump approach will give you the best performance w.r.t. guest pause
> times (i.e., just pausing for the COW fault handler), but it does have the
> distinct disadvantage of potentially using 2x the guest memory (i.e., if the
> parent process races ahead and writes to all of the pages before you finish the
> dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child
> memory as you copy it.

This is a very good point. fork must be evaluated first because it
literally already provides you a readonly memory snapshot of the guest
memory.

The memory cons mentioned above could lead to both -ENOMEM of too many
guests runs live snapshots at the same time in the same host, unless
overcommit_memory is set to 1 (0 by default). Even then if too many
live snapshots are running in parallel you could hit the OOM killer if
there are just a bit too many faults at the same time, or you could
hit heavy swapping which isn't ideal either.

In fact the -ENOMEM avoidance (with qemu failing) is one of the two
critical reasons why qemu always set the guest memory as
MADV_DONTFORK. But that's not the only reason.

To use the fork() trick you'd need to undo the MADV_DONTFORK first but
that would open another problem: there's a race condition between
fork() O_DIRECT and <4k hardblocksize of virtio-blk. If there's any
read() syscall with O_DIRECT with len=512 while fork() is running
(think if the aio runs in parallel with the live snapshot thread that
forks the child to dump the snapshot) and if the guest writes with the
CPU to any 512 fragment of the same page that is the destination
buffer of the write(len=512) (on two different 512bytes area of the
same guest page) the O_DIRECT write will get lost.

So to use fork we'd need to fix this longstanding race (I tried but in
the end we declared it an userland issue because it's not exploitable
to bypass permissions or corrupt kernel or unrelated memory). Or you'd
need to add locking between the dataplane/aio threads and the live
snapshot thread to ensure no direct-io I/O is ever in-flight while
fork runs.

The O_DIRECT however would only help if it's qemu TCG, if it's KVM
it's not even enough to stop O_DIRECT reads. KVM would use
gup(write=1) from the async-pf all the time... and then the shadow
pagetables would go out of sync (it won't destabilize the host of
course, but the guest memory would be corrupt then and guest would
misbehave). In short all vcpu would need to be halted too in addition
to all direct-I/O. Possibly those gets stopped anyway before starting
the snapshot (they certainly are stopped before starting postcopy
live migration :).

Even if it'd be possible to serialize things in qemu to prevent the
race, unless we first fix fork vs o_direct race in the host kernel, I
wouldn't feel safe in removing MADV_DONTFORK and depend on fork for
the snapshotting. This is also because fork may still be used by qemu
in pci hotplug (to fork+exec but it cannot vfork because it has to
alter the signal handlers first). fork() is something people may do
without thinking it'll automatically trigger memory corruption in the
parent.

(If we'd use fork instead of userfaultfd for this, it'd also be nice
to add a madvise for THP, that will alter the COW faults on THP pages,
to copy only 4k and split the pmd into 512 ptes by default leaving 511
not-cowed pte readonly. The split_huge_page design change that adds a
failure path to split_huge_page would prevent having to split the
trans_huge_pmd on the child too, so it would be more efficient and
strightforward change than it would be if we'd add such a new madvise
right now. Redis would then use that new madvise too instead of
MADV_NOHUGEPAGE, as it uses fork for something similar as the above
live snapshot already and it creates random access cows that with THP
are copying and allocating more memory than with 4k pages. And
hopefully it's already getting the O_DIRECT race right if it happens
uses O_DIRECT + threads too.)

wrprotect userfaults would eliminate the need for fork, and they would
limit the maximal amount of memory allocated by the live snapshot to
the maximal number of asynchronous page faults multiplied by the
number of vcpus multiplied by the page size, and you can increase it
further by creating some buffering but you can still throttle it fine
to keep the buffer limited in size (not like with fork where the
buffer is potentially as large as the entire virtual machine
size). Furthermore you could in theory split the THP and map writable
only the 4k you copy during the wrprotect userfault (COW would always
cow the entire 2m increasing the latency of the fault in the parent).

Problem is, there are no kernel syscalls that allows you to mark all
trans_huge_pmds and ptes wrprotected without altering the vma too, and
we cannot mangle over vmas for doing these things as we could end up
with too many vmas and a -ENOMEM failure (not to tell the inefficiency
of such a load). So at the moment just adding the wrprotect faults to
the userfaultfd protocol isn't going to move the needle until you also
add new commands or syscalls to mangle pte/pmd wrprotect bits.

The real things to decide in API terms if those new operations (that
would likely be implemented in fremap.c) should be exposed to
userland as standalone syscalls that works similar to the
mremap/mprotect but never actually touch any vma and only hold the
mmap_sem for reading. Or if they should be embedded in the userfaultfd
wire protocol as additional commands to write into the ufd.

You'd need one syscall to mark all guest memory readonly. And then the
same syscall could be invoked on the 4k region that triggered
wrprotect-userfault to mark it writable again, just before writing the
same 4k region into the ufd to wakeup and retry the page fault.

The advantage of embedding all pte/pmd mangling inside special ufd
commands is that sometime the ufd write needed to wakeup the page
fault could be skipped (i.e. we could resolve the userfault with 1
syscall instead of 2). The downside is that it forces to change the
userfault protocol every time you want to add a new command. While if
we keep the ufd purely as a page fault event notification/wakeup
mechanism without allowing it to change the pte/pmds (and we leave the
task of mangling the pte/pmds to new syscalls), we could more easily
create a long lived userfault protocol that provides all features now
and we could extend the syscalls to mangle the address space with more
flexibility later.

Also the more commands that are only available through userfaultfd,
the more the MADV_USERFAULT for SIGBUS usage becomes less interesting
as SIGBUS would only provide a reduced set of features that cannot be
available without dealing with an userfaultfd. For example a wrprotect
fault (as result of the task forking) if the userfaultfd is closed
would then need to SIGBUS or not if only MADV_USERFAULT is set?

Until we added the wrprotect faults into the equation,
MADV_USERFAULT+SIGBUS was functionally equivalent (just less efficient
for multithreaded programs and uncapable of dealing with kernel
access). Once we add wrprotect faults I'm uncertain it is worth to
retain MADV_USERFAULT and SIGBUS.

The fast path branch that userfaultfd requires in the page fault does:

/* Deliver the page fault to userland, check inside PT lock */
if (vma->vm_flags & VM_USERFAULT) {
pte_unmap_unlock(page_table, ptl);
return handle_userfault(vma, address, flags);
}

It'd be trivial to change it to:

if (vma->vm_userfault_ctx != NULL_VM_USERFAULTFD_CTX) {
pte_unmap_unlock(page_table, ptl);
return handle_userfault(vma, address, flags);
}

(and the whole block would still be optimized at build time with
CONFIG_USERFAULT=n for embedded without virt needs)

In short we need to decide 1) if to retain the MADV_USERFAULT+SIGBUS
behavior, and 2) if to expose the new commands needed to flip the
wrprotect bits without altering the vmas and to copy the pages
atomically as standalone syscalls or as new commands of the
userfaultfd wire protocol.

About the question if I intend to add wrprotect faults, the answer is
that I certainly do. I think it is good idea regardless of the live
snapshotting usage, because it also allows to more efficiently
implement distributed share memory too (allowing to map the memory
readonly if shared and to make it exclusive again on write access) if
anybody dares.

In theory the userfaultfd could also pre-cow the page and return you
the page through the read() syscall if we add a new protocol later to
accellerate it. But I think the current protocol shouldn't go that far
and we should aim for a protocol that is usable by all even if some
more operation will have to happen in userland than in the
accelerated version specific for live snapshotting (the in-kernel cow
wouldn't necessarily provide a significant speedup anyway).

Supporting only wrprotect faults should be fine by just defining two
bits during registration, one for not-present and one for wrprotect
faults then it's up to you if you set one of the two or both. (at
least one has to be set)

> I absolutely plan on releasing these patches :-) CRIU was the first open-source
> userland I had planned on integrating with. At Google, I'm working with our
> home-grown Qemu replacement. However, I'd be happy to help with an effort to
> get softdirty integrated in Qemu in the future.

Improving precopy by removing the software driven log buffer sounds
like a great to precopy. But I tend to agree with Zhang that it's
orthogonal with the actual postcopy stage that requires to block the
page fault not just track it later (for both live migration and live
snapshot). precopy is not getting obsoleted by postcopy, they just
work in tandem and that applies especially to live migration but I see
no reason why the same applies to live snapshotting like said above.

Comments welcome, thanks!
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Josh Poimboeuf: "Re: [PATCHv2 2/3] kernel: add support for live patching"
Previous message: SÃren Brinkmann: "Re: [PATCH 1/1] net: Xilinx: Deletion of unnecessary checks before two function calls"
In reply to: zhanghailiang: "Re: [PATCH 00/17] RFC: userfault v2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]