Re: [PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()
From: Yin Tirui
Date: Thu Apr 02 2026 - 03:49:57 EST
On 4/2/26 14:46, Lorenzo Stoakes (Oracle) wrote:
>
> I mean you would have needed to handle this case in any event, since this change
> is strictly an equivalent reworking of zap_huge_pmd().
>
> But it seems that doing so has clarified the requirements somewhat here :)
>
> I haven't had a look at that series yet (please cc this email if you weren't
> already, I do filter a lot of stuff due to how much mail I get daily)
Hi Lorenzo,
Thanks for the quick reply. I will definitely CC you on the v4 series.
>
> So if this is a PMD leaf entry it will be present and PFN map, so I'd have
> thought simply adding:
>
> /* Huge PFN map must deposit, as cannot refault. */
> if (vma_test(vma, VMA_PFNMAP_BIT))
> return true;
>
> Would suffice?
Here is the dilemma:
Currently, VFIO uses vmf_insert_pfn_pmd() to create huge pfnmaps on page
faults. This sets VM_PFNMAP in vfio_pci_core_mmap(), but it does not
deposit a pgtable (unless arch_needs_pgtable_deposit() is true).
To resolve this,
Option A: Force VFIO (vmf_insert_pfn_pmd) to also deposit pgtables. This
unifies the VM_PFNMAP lifecycle. However, since VFIO can refault,
depositing pgtables here incurs unnecessary memory overhead.
Option B: Introduce a new VMA flag set during remap_pfn_range(), which
we can explicitly check in has_deposited_pgtable().
Option C: Check vma->vm_ops->fault (and huge_fault). We would only
deposit pgtables for mappings without fault handlers. However, this is
fragile because a driver might still register a .fault() handler that
simply returns VM_FAULT_SIGBUS.
Do you have a preference among these, or perhaps another idea?
>
> By the way, I am wondering if the prot bits are correctly preserved on page
> table deposit, as this is key for pfn map (e.g. if the range is uncached, for
> instance). That's something to check and ensure is correct.
>
> I _suspect_ they will be, as we have pretty well established mechanisms for that
> (propagate vma->vm_page_prot etc.) but definitely worth making sure.
>
Yes, they are correctly preserved!
During a PMD split in __split_huge_pmd_locked(), we populate the
deposited pgtable like this:
entry = pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd));
set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
The newly refactored pmd_pgprot() correctly extracts the exact
protection bits (including crucial cache modes like UC/WC for device
memory) from the huge PMD, strips the hardware-specific huge bit, and
returns a pure PTE-level pgprot_t.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20260228070906.1418911-5-yintirui@xxxxxxxxxx/
>>
>> --
>> Yin Tirui
>>
>
> Cheers, Lorenzo
--
Yin Tirui