Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
From: Peter Xu
Date: Thu Apr 23 2026 - 14:57:50 EST
On Thu, Apr 23, 2026 at 07:08:00PM +0100, Kiryl Shutsemau wrote:
> On Thu, Apr 23, 2026 at 10:50:06AM -0400, Peter Xu wrote:
> > Hello, Kiryl,
> >
> > On Thu, Apr 23, 2026 at 03:27:11PM +0100, Kiryl Shutsemau wrote:
> > > The patchet is pretty good shape in my eyes and will probably drop RFC
> > > tag.
> >
> > I still have some high level questions not yet got answered. Do you want
> > to answer them?
> >
> > https://lore.kernel.org/all/ad59TxAHNwFWH7Cc@x1.local/
>
> Sorry, reply to this got lost in my TODO list.
No worries.
>
> > In summary, it's about:
> >
> > - Whether we have explored other approaches on page hotness tracking
>
> So, for read/write tracking we have clear_refs=1, page_idle and DAMON.
> Did I miss something?
>
> clear_refs is process-wide hammer. And you can miss a hot page if it
> races with LRU rotation.
>
> page_idle needs rmap. It will not scale.
Yes. If you would benefit from a per-mm page_idle, then it may apply to us
too if we will be enforced to implement full-userspace swap in QEMU.
That's also why I suggested (in my previous reply) that we split the
requirement: one is for hotness tracking, the other is about read-inclusive
trapping (v.s. wr-protect only traps).
>
> DAMON is built around sampling. It is good for working set estimation,
> but I don't think it is directly useful for eviction decision. It can
> miss hot pages. LRU rotation will also loose info.
Exactly. If we need to collect ACCESS bit (or anything similar) for
eviction accuracy pusrpose, IIUC we need per-page info, we can't estimate
by sampling.
>
> None of them gives comparable capabilities.
I want to see if some of your work can be generalized so we can use too,
and we can also work together.
>
> We also need a mechanism to atomically evict pages.
Yes, this is the 2nd question below, and btw uffd-wp can also achieve this.
>
> > - Whether read protection is required for an userspace swap system
> > (e.g. did you get time to have a look at umap?)
>
> I looked at it briefly, so I can miss details.
>
> IIUC, in absence of read tracking it doesn't collect hotness information
> at all. The eviction is based on fault-in time: the oldest faulted-in
For example, let's imagine if we can have a per-mm idle page tracker, would
it work for you to collect hotness info?
The other idea is, no matter whether we use MGLRU or legacy LRU, if we can
expose a better interface to share hotness info from kernel to userspace,
would it be possible?
> page gets evicted first. I guess it is fine if you don't care much about
> refault cost. Like, if your workload fits into memory completely and
> refaults are rare.
One thing to mention is, if we have any hotness tracking facility ready
above (e.g. per-mm idle page tracking) we _will_ trap read faults too; it's
just that it'll be much faster (when it's hardware ACCESS bit).
So if I'm not wrong, what I am trying to discuss as a full userspace swap
system will always trap read too for most of the cases.
The difference is only about that 5ms (in case of 30s+5ms example I gave in
the other email). Your RW protection will also trap that 5ms, what I
described won't: when a decision is made, we wr-protect the page, any read
on top of it will still go through so it will trigger a refault. My point
is, that 5ms missing over 30s (in reality maybe more than 30s) sampling
window (which covered read accesses) isn't a major issue, and IMHO it's not
a strong enough reason to include the whole RW feature.
The other thing is, as I mentioned in the other email, I still don't know
how the current RW protection would work for anonymous. I don't yet think
the user swapper can read the anon page with RW-protected pgtables. So far
my understanding is maybe you only care about shmem so it's fine, but it'll
always be great to confirm with you.
Thanks,
>
> That's not my case.
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
>
--
Peter Xu