[LSF/MM TOPIC] userfaultfd

From: Andrea Arcangeli
Date: Wed Jan 14 2015 - 18:01:42 EST


Hello,

I would like to attend this year (2015) LSF/MM summit. I'm
particularly interested about the MM track, in order to get help in
finalizing the userfaultfd feature I've been working on lately.

An overview on the userfaultfd feature can be read here:

http://lwn.net/Articles/615086/

In essence the userfault feature could be imagined as an optimal
implementation for userland driven on demand paging similar to
PROT_NONE+SIGSEGV.

userfaultfd is fundamentally allowing to manage memory at the
pagetable level by delivering the page fault notification to userland
to handle it with proper userfaultfd commands that mangle the address
space, without involving heavyweight structures like vmas (in fact the
userfaultfd runtime load never takes the mmap_sem for writing, just
like its kernel counterpart wouldn't). The number of vmas is limited
too so they're not suitable if there are too many scattered faults and
the address space is not limited. userfaultfd allows all userfaults to
happen in parallel from different threads and it relies on userland to
use atomic copy or move commands to resolve the userfaults.

By adding more featured commands to the userfaultfd protocol (spoken
on the fd, like the basic atomic copy command that is needed to
resolve the userfault) in the future we can also mark regions readonly
and trap only wrprotect faults (or both wrprotect and non present
faults simultaneously).

Different userfaultfd can already be used independently by multiple
librarians and the main application within the same process.

The userfaultfd once opened, can also be passed using unix domain
sockets to a manager process (use case 5) below wants to do this), so the
same manager process could handle the userfaults of a multitude of
different process without them being aware about what is going on
(well of course unless they later try to use the userfaultfd themself
on the same region the manager is already tracking, which is a corner
case the relevancy of which should be discussed).

There was interest from multiple users, hope I'm not forgetting some:

1) KVM postcopy live migration (one form of cloud memory
externalization). KVM postcopy live migration is the primary driver
of this work:
http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
)

2) KVM postcopy live snapshotting (allowing to limit/throttle the
memory usage, unlike fork would).

3) KVM userfaults on shared memory (currently only anonymous memory
is handled by the userfaultfd but there's nothing that prevents to
extend it and allow to register a tmpfs region in the userfaultfd
and fire an userfault if the tmpfs page is not present)

4) alternate mechanism to notify web browsers or apps on embedded
devices that volatile pages have been reclaimed. This basically
avoids the need to run a syscall before the app can access with the
CPU the virtual regions marked volatile. This also requires point 3)
to be fulfilled, as volatile pages happily apply to tmpfs.

5) postcopy live migration of binaries inside linux containers
(provided there is a userfaultfd command [not an external syscall
like the original implementation] that allows to copy memory
atomically in the userfaultfd "mm" and not in the manager "mm",
hence the main reason the external syscalls are going away, and in
turn MADV_USERFAULT fd-less is going away as well).

6) qemu linux-user binary emulation was also briefly interested about
the wrprotection fault notification for non-x86 archs. In this
context the userfaultfd ""might"" (not sure) be useful to JIT
emulation to efficiently protect the translated regions by only
wrprotecting the page table without having to split or merge vmas
(the risk of running out of vmas isn't there for this use case as
the translated cache is probably limited in size and not heavily
scattered).

7) distributed shared memory that could allow simultaneous mapping of
regions marked readonly and collapse them on the first exclusive
write. I'm mentioning it as a corollary, because I'm not aware of
anybody who is planning to use it that way (still I'd like that
this will be possible too just in case it finds its way later on).

The currently planned API (as hinted above) is already different to
the first version of the code posted a couple of months ago, thanks to
the valuable feedback received by the community so far.

As usual suggestions will be welcome, thanks!
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/