Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping.
There is some complexity, indeed. Worse, there are some quirks of UFFD
that make life hard for no reason and some uffd and iouring bugs.
As for my sales pitch - I agree that I am not the best car dealer… :(
When I say performance is not bad, I mean that the core operations of
page-fault handling, prefetch and reclaim do not induce high overhead
*after* the improvements I sent or mentioned.
The benefit of doing so from userspace is that you have full control
over the reclaim/prefetch policies, so you may be able to make better
Some workloads have predictable access patterns (see for instance "MAGE:
Nearly Zero-Cost Virtual Memory for Secure Computation”, OSDI’21). You may
be handle such access patterns without requiring intrusive changes to the
I am aware that there are some caveats, as zapping the memory does not
guarantee that the memory would be freed since it might be pinned for a
variety of reasons. That's the reason I mentioned the processes have "some
level of cooperation" with the manager. It is not intended to deal with
adversaries or uncommon corner cases (e.g., processes that use UFFD for
their own reasons).
It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that?
I know it’s not just long-term pinnings. That’s what “variety of reasons”
stood for. ;-)
Aging is a tool for certain types of reclamation policies. Some do not
require it (e.g., random). You can also have compiler/application-guided
reclamation policies. If you are really into “aging”, you may be able
to use PEBS or other CPU facilities to track it.
Anyhow, the access-bit by itself not such a great solution to track
aging. Setting it can induce overheads of >500 cycles from my (and
Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics.
Hints, as those that Michal mentioned, prevent the efficient use of
userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event
when the page is brought back from swap. So using
MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom
prefetch policy, for instance. It would also require you to live
with the kernel reclamation/IO stack for better and worse.
As for DAMON, I am not very familiar with it, but from what I remember
it seemed to look on a similar direction. IMHO it is more intrusive
and less configurable (although it can have the advantage of better
integration with various kernel mechanism). I was wondering for a
second why you give me such a hard time for a pretty straight-forward
extension for process_madvise(), but then I remembered that DAMON got
into the kernel after >30 versions, so I’ll shut up about that. ;-)
Putting aside my use-case (which I am sure people would be glad to criticize),
I can imagine debuggers or emulators may also find use for similar schemes
(although I do not have concrete use-cases for them).
I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :)
I have some more vague ideas, but I am afraid that you will keep
saying that it makes more sense to handle such events from within
a process. I am not sure that this is true. Even for the emulators
that we discuss, the emulated program might run in a different
address space (for sandboxing). You may be able to avoid the need
for remote-UFFD and get away with the current non-cooperative
UFFD, but zapping the memory (for atomic updates) would still
require process_madvise(MADV_DONTNEED) [putting aside various
Anyhow, David, I really appreciate your feedback. And you make
strong points about issues I encounter. Yet, eventually, I think
that the main question in this discussion is whether enabling
process_madvise(MADV_DONTNEED) is any different - from security
point of view - than process_vm_writev(), not to mention ptrace.
If not, then the same security guards should suffice, I would