Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging

From: David Hildenbrand
Date: Mon Mar 03 2025 - 15:50:18 EST


On 03.03.25 21:01, Mathieu Desnoyers wrote:
On 2025-02-28 17:32, Peter Xu wrote:
On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
On 2025-02-28 11:32, Peter Xu wrote:
On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
For the VM use-case, I wonder if we could just add a userfaultfd
"COW" event that would notify userspace when a COW happens ?

I don't know what's the best for KSM and how well this will work, but we
have such event for years.. See UFFDIO_REGISTER_MODE_WP:

https://man7.org/linux/man-pages/man2/userfaultfd.2.html

userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
resulting from a mmap mapping, but returns EINVAL if I pass a
page-aligned address which sits within a private file mapping
(e.g. executable data).

Yes, so far sync traps only supports RAM-based file systems, or anonymous.
Generic private file mappings (that stores executables and libraries) are
not yet supported.


Also, I notice that do_wp_page() only calls handle_userfault
VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
set.

AFAICT that's expected, unshare should only be set on reads, never writes.
So uffd-wp shouldn't trap any of those.


AFAIU, as it stands now userfaultfd would not help tracking COW faults
caused by stores to private file mappings. Am I missing something ?

I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on
most mappings. That one is async, though, so more like soft-dirty. It
might be doable to try making it sync too without a lot of changes based on
how async tracking works.

I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
be a good fit. Here is what I have in mind to replace the ksmd scanning
thread for the VM use-case by a purely user-space driven scanning:

Within qemu or similar user-space process:

1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and
UFFDIO_REGISTER_MODE_WP mode.

2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag
to detect memory which stays invariant for a long time.

3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to.
Keep track of memory which is frequently modified, so it can be left alone and
not write-protected nor merged anymore.

4) Whenever pages stay invariant for a given lapse of time, merge them with the new
madvise(2) KSM_MERGE behavior.

Let me know if that makes sense.

Note that one of the strengths of ksm in the kernel right now is that we write-protect + try-deduplicate only when we are fairly sure that we can deduplicate (unstable tree), and that the interaction with THPs / large folios is fairly well thought-through.

Also note that, just because data hasn't been written in some time interval, doesn't mean that it should be deduplicated and result in CoW on next write access.

One probably would have to mimic what the KSM implementation in the kernel does, and built something like the unstable tree, to find candidates where we can actually deduplciate. Then, have a way to not-deduplicate if the content changed.

--
Cheers,

David / dhildenb