Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps

From: Boone, Max

Date: Wed Mar 11 2026 - 06:00:11 EST



> On Mar 10, 2026, at 4:19 PM, David Hildenbrand (Arm) <david@xxxxxxxxxx> wrote:
>
> !-------------------------------------------------------------------|
> This Message Is From an External Sender
> This message came from outside your organization.
> |-------------------------------------------------------------------!
>>
>> I was previously testing on 6.12 and didn’t see any changes to vfio-pci or
>> pagewalk.c which prompted me to check whether I could reproduce the
>> bug in a more recent kernel.
>>
>> However, when I tried to reproduce the bug on 7.0-rc2 (after adding some
>> tracing to get a clearer picture of the sequence of events) it doesn’t happen.
>> The VFIO DMA set operation is much faster on 7.0, so possibly the race
>> window is too small for it to occur in reasonable time.
>
> Interesting. You could try adding a delay to a test kernel to see if you
> can still provoke it.
>
> There is the slight possibility that something else fixed the race for
> your reproducer by "accident".
>
> [...]

Could be that the race window was made a lot smaller by [1], I see that the occurrence
of the race also drops significantly already when I’m just adding some extra trace
logs in the VFIO DMA set functions.

[…]

>
> Just to double-check: is that expected?
>
> I wonder why "-EINVAL" would be returned here. Do you know?
>

I don’t think it’s expected, but at least it’s an error that can be caught in userspace.
I’ll do a bit more digging into why and where that originates, but I think that’ll get a
patch in vfio.

The -EINVAL originates from:

vfio_dma_do_map -> vfio_pin_map_dma -> vfio_pin_pages_remote
-> vaddr_get_pfns -> pin_user_pages_remote (mm/gup.c)

Possibly that’s also the origin of the concurrent PUD modification that requires
the retry in the walker in this patch.

>>
>> For my own understanding, why is this patch preferred over:
>> - if (pud_none(*pud))
>> + if (pud_none(*pud) || pud_leaf(*pud))
>> in the walk_pud_range function?
>
> It might currently work for PUDs, but as soon as we have non-present PUD
> entries (like migration entries) the code could become shaky: pud_leaf()
> is only guaranteed to yield the right result if pud_present() is true.
>
> So I decided to instead make walk_pud_range() look more similar to
> walk_pmd_range(), which is quite helpful for spotting actual differences
> in the logic.
>
>>
>> I do think moving the check to walk_pmd_range is a more clear on the code’s intent and
>> personally prefer the code there, but I don’t see why this check is removing the possibility
>> of a race after the (!pud_present(pudval) || pud_leaf(pudval)) check, as to me it looks
>> like the PMD entry was possible to disappear between the splitting and this check?
>
> I distilled that in the comment: PMD page tables cannot/are not
> reclaimed. So once you see a PMD page table, it's not going anywhere
> while you hold relevant locks (mmap_lock or VMA lock).
>
> Only PMD leaf entries can get zapped any time and PMD none entries can
> get populated any time. But not PMD page tables.

Gotcha, thanks!

>
>>
>> Anyways, regardless, this patch resolves the bug and looks good to me - what’s the
>> course of action as we probably want to backport this to earlier kernels as well. Shall
>> I send in a new PATCH without cover letter and take it from there?
>
> Right, I think you should:
>
> (1) rework the patch description to incorporate the essential stuff from
> the cover letter
> (2) Identify and add Fixes: tag and Cc: stable
> (3) Document that we are reworking the code to mimic what we do in
> walk_pmd_range(), to have less inconsistency on the core logic
> (4) Document why you think the reproducer fails on newer kernels. (or
> best try to get it reproduced by adding some delays in the code)
> (5) Clarify that only PUD handling are prone to the race and that PMDs
> are fine (and point out why)
> (6) Use a patch subject like "mm/pagewalk: fix race between unmapping
> and refaulting in walk_pud_range()"
>
> Once you resend, best to add
>
> Co-developed-by: David Hildenbrand (Arm) <david@xxxxxxxxxx>
> Signed-off-by: David Hildenbrand (Arm) <david@xxxxxxxxxx>
>
> Above your SOB.
>
> To get something like:
>
> Co-developed-by: David Hildenbrand (Arm) <david@xxxxxxxxxx>
> Signed-off-by: David Hildenbrand (Arm) <david@xxxxxxxxxx>
> Signed-off-by: Max Boone <mboone@xxxxxxxxxx>
>
> Note that the existing
>
> Signed-off-by: Max Tottenham <mtottenh@xxxxxxxxxx>
>
> Is weird, as Max Tottenham did not send out this patch. If he was
> involved in the development, you should either make him
>
> Suggested-by:
>
> Or
> Debugged-by:
>
> Or
> Co-developed-by: + Signed-off-by:
>
> See Documentation/process/submitting-patches.rst
>
>
> Let me know if you have any questions :)
>

Will do, thanks a lot!

> --
> Cheers,
>
> David

[1] https://lore.kernel.org/all/20250814064714.56485-1-lizhe.67@xxxxxxxxxxxxx/

Attachment: smime.p7s
Description: S/MIME cryptographic signature