Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory

From: David Hildenbrand (Arm)

Date: Wed Apr 22 2026 - 14:42:14 EST


On 4/21/26 16:33, Kiryl Shutsemau wrote:
> On Tue, Apr 21, 2026 at 03:03:56PM +0200, David Hildenbrand (Arm) wrote:
>> On 4/19/26 16:33, Kiryl Shutsemau wrote:
>>>
>>> See https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git uffd/rfc-v3
>>>
>>
>> Quick feedback from skimming over it:
>>
>>
>> 1) ARCH_SUPPORTS_PROT_NONE needs some thought, because I am pretty sure all
>> architectures support something like mprotect(PROT_NONE), and the config
>> option might be misleading.
>>
>> So you very likely want to express different semantics here. You want to
>> know whether pte_protnone()/pmd_protnone() works.
>
> We do support mprotect(PROT_NONE) everywhere, but we don't always have a
> way to distinguish such entries from others without VMA in hands. Like,
> there are other PTEs that don't have present bit set. In my and NUMA
> balancing context we cannot rely on VMA, because we want to install
> PAGE_NONE entires into accessible VMA.

Exactly. So it's not ARCH_SUPPORTS_PROT_NONE.

>
> So we need two things; pte/pmd_protnone() checks and PAGE_NONE itself.
> The first to test PTE for PAGE_NONE, the second for pte/pmd_modify() to
> make the entry protnone.
>
> Currently, generic code only use this functionality for NUMA balancing
> and gated by NUMA balancing config option. So I moved it under separate
> config option.
>
> Do you want it to be named differently?

Would ARCH_SUPPORTS_PXX_PROTNONE or sth. like that better describe that
pte_protnone()/pmd_protnone() do what we want?

>
>> 2) The other stuff is really just an extension of existing WP handling.
>> I suspect we want to have some reasonable cleanups to not end up in
>> common code with
>>
>> @@ -1841,7 +1841,7 @@ static void copy_huge_non_present_pmd(
>> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>> mm_inc_nr_ptes(dst_mm);
>> pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> - if (!userfaultfd_wp(dst_vma))
>> + if (!userfaultfd_wp(dst_vma) && !userfaultfd_rwp(dst_vma))
>> pmd = pmd_swp_clear_uffd_wp(pmd);
>> set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>>
>> All the uffd handling should be better isolated (i.e., a single vma check?),
>> and likely the uffd bit should be abstracted away from being called "wp" to
>> something more generic.
>>
>> Maybe it's simply a "uffd" flag which's semantics depend
>> on the vma flags.
>>
>> Maybe something like:
>>
>> @@ -1841,7 +1841,7 @@ static void copy_huge_non_present_pmd(
>> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>> mm_inc_nr_ptes(dst_mm);
>> pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> if (!userfaultfd_uses_pte_bit(dst_vma))
>> pmd = pmd_swp_clear_uffd(pmd);
>> set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>>
>> Not sure, needs another thought. But I think there are some decent
>> cleanups to be had.
>
> That's fair. Maybe userfaultfd_protected() name is better for the VMA
> check?

Yes, something like that could also work.

>
> And about UFFD_WP bit name. Maybe we can just drop _WP: _PAGE_UFFD_WP ->
> _PAGE_UFFD, pte_uffd_wp() -> pte_uffd()?

Yes, I hinted at the above with pmd_swp_clear_uffd().

>
> But it is a lot of changes. Can I do the bit rename as a follow up
> patchset?

Let's get this clean. There is no need to rush that in ;)

I suspect it's a fairly mechanical change.

>
>> 3) Some other stuff needs a second thought, like
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 8e7dc2c6ee738..08fc18f1290d4 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -695,7 +695,8 @@ static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
>> /* ... and a write-fault isn't required for other reasons. */
>> if (pmd_needs_soft_dirty_wp(vma, pmd))
>> return false;
>> - return !userfaultfd_huge_pmd_wp(vma, pmd);
>> + return !userfaultfd_huge_pmd_wp(vma, pmd) &&
>> + !userfaultfd_huge_pmd_rwp(vma, pmd);
>> }
>>
>> How can a pte be writable and prot_none at the same time? Maybe just confused AI
>> output that you should carefully double check before sending that out officially.
>
> Note that this path is for !pmd_write() case to begin with. It serves
> FOLL_FORCE case. I believe this check is correct: we don't want to allow
> to write to such pages even with FOLL_FORCE.
>
> But looking around, I missed gup_can_follow_protnone() modification. It
> has to return false for RWP.

Right, read-permission checks come before the write-permission checks.

>
>> 4) How do we want to handle PM_UFFD_WP?
>>
>> We are pretty much out of flags soon. Overloading PM_UFFD_WP means that we will not
>> be able to easily support using a separate bit.
>>
>> But our internal design will not easily allow that either, and I am not really
>> sure we want to go down that path any time soon.
>>
>> Maybe we could document this for now as "In WP VMAs, indicated WP PTEs.
>> Otherwise, in RWP VMAs, indicates RWP.". Whenever we would allow both at the
>> same time, we could change the semantics. User space would fail to create one
>> with both protection types for now either way.
>
> Yeah. I think about doing documentation-only update for PM_UFFD_WP for
> now.

Ok, good!

--
Cheers,

David