Re: [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE)

From: Andrea Arcangeli
Date: Wed Jan 30 2019 - 09:43:11 EST


Hello Mike,

On Wed, Jan 30, 2019 at 10:13:36AM +0200, Mike Rapoport wrote:
> We (CRIU) have some concerns about obsoleting soft-dirty in favor of
> uffd-wp. If there are other soft-dirty users these concerns would be
> relevant to them as well.
>
> With soft-dirty we collect the information about the changed memory every
> pre-dump iteration in the following manner:
> * freeze the tasks
> * find entries in /proc/pid/pagemap with SOFT_DIRTY set
> * unfreeze the tasks
> * dump the modified pages to disk/remote host
>
> While we do need to traverse the /proc/pid/pagemap to identify dirty pages,
> in between the pre-dump iterations and during the actual memory dump the
> tasks are running freely.
>
> If we are to switch to uffd-wp, every write by the snapshotted/migrated
> task will incur latency of uffd-wp processing by the monitor.

That's valid concern indeed.

I didn't go into the details of what additional feature is needed in
addition to what is already present present in Peter's current
patchset, but you're correct that in order to perform well to do the
softdirty equivalent, we'll also need to add an async event model.

The async event model would be set during UFFD registration. It'd work
like async signals, you just queue up uffd events in the kernel by
allocating them with a slab object (not in the kernel stack of the
faulting process). Only if the monitor won't read() them fast enough
it'll eventually block the write protect fault and release the
mmap_sem but the page fault would always be resolved by the kernel
even in that case. For the monitor there'll be just a stream of
uffd_msg structures to read in multiples of the uffd_msg structure
size with a single syscall per wakeup of the monitor. Conceptually
it'd work the same as how PML works for EPT.

The main downside will be an allocation per fault (soft dirty doesn't
need to do such allocation), but there will be no round-trip to
userland latency added to the wrprotect fault that needs to be logged.

We need the synchronous/blocking uffd-wp for other things that aren't
related to soft dirty and can't be achieved with an async model like
softdirty. Adding an async model later would be a self contained
feature inside uffd.

So the idea would be to ignore any comparison with softdirty until
uffd-wp is finalized, and then evaluate the possibility of adding an
async model which would be simple thing to add in comparison of the
uffd-wp feature itself.

The theoretical expectation would be that softdirty would perform
better for small processes (but for those the overall logging overhead
is small anyway), but when it gets to the hundred-gigabytes/terabytes
regions, async uffd-wp should perform much better.

Thanks,
Andrea