Re: [PATCH v3 13/13] mm/huge_memory: add and use has_deposited_pgtable()

From: Yin Tirui

Date: Tue Apr 14 2026 - 03:39:59 EST

Hi Lorenzo and David,

Sorry for the late reply.

On 4/7/26 18:48, Lorenzo Stoakes wrote:
> On Thu, Apr 02, 2026 at 03:49:35PM +0800, Yin Tirui wrote:
>>
>>
>> On 4/2/26 14:46, Lorenzo Stoakes (Oracle) wrote:
>>>
>>> I mean you would have needed to handle this case in any event, since this change
>>> is strictly an equivalent reworking of zap_huge_pmd().
>>>
>>> But it seems that doing so has clarified the requirements somewhat here :)
>>>
>>> I haven't had a look at that series yet (please cc this email if you weren't
>>> already, I do filter a lot of stuff due to how much mail I get daily)
>>
>> Hi Lorenzo,
>>
>> Thanks for the quick reply. I will definitely CC you on the v4 series.
>
> Thanks.
>
>>
>>>
>>> So if this is a PMD leaf entry it will be present and PFN map, so I'd have
>>> thought simply adding:
>>>
>>> /* Huge PFN map must deposit, as cannot refault. */
>>> if (vma_test(vma, VMA_PFNMAP_BIT))
>>> return true;
>>>
>>> Would suffice?
>>
>> Here is the dilemma:
>>
>> Currently, VFIO uses vmf_insert_pfn_pmd() to create huge pfnmaps on page
>> faults. This sets VM_PFNMAP in vfio_pci_core_mmap(), but it does not
>> deposit a pgtable (unless arch_needs_pgtable_deposit() is true).
>
> Hmmm... it's only the VFIO and hyperv drivers using this.
>
> Wouldn't we generally want a deposited huge page here now we're allowing huge
> PFN maps?
>
> Or are this _special cases_ where we have a PMD-sized entry but are not
> necessarily wanting to treat it as THP?
>
> This is a real wrinkle in this whole series no?
>
> David - any thoughts?
>
>>
>> To resolve this,
>>
>> Option A: Force VFIO (vmf_insert_pfn_pmd) to also deposit pgtables. This
>> unifies the VM_PFNMAP lifecycle. However, since VFIO can refault,
>> depositing pgtables here incurs unnecessary memory overhead.
>
> How can VFIO refault as a PFN mapping? Does it intentionally sometimes
> clear PTE entries to effect a refault, and implement a custom fault
> handler?
>
> I guess having a fault handler makes it refaultable...
>
> I mean obviously that then contradicts the suggested comment above :)
>
> That seems to me to cast a bit of a question over the whole series - having
> PMD mappings that are _sometimes_ THP and _sometimes_ not is weird (TM).
>
> And it'd suck to add - yet another very specific check - to determine if we
> do, in fact, assume THP for a PMD sized PFN map.

Yes, exactly. VFIO and Hyper-V rely on their custom `.fault` handlers to
dynamically build mappings. In contrast, `remap_pfn_range()` establishes
static pre-mappings.

>
>>
>> Option B: Introduce a new VMA flag set during remap_pfn_range(), which
>> we can explicitly check in has_deposited_pgtable().
>
> Yeah would rather not, that feels like a hack.

Agreed.

>
>>
>> Option C: Check vma->vm_ops->fault (and huge_fault). We would only
>> deposit pgtables for mappings without fault handlers. However, this is
>> fragile because a driver might still register a .fault() handler that
>> simply returns VM_FAULT_SIGBUS.
>
> I mean again this is yet another check (TM). But probably the most preferable I
> think.
>
> Wouldn't a driver doing that be being somewhat redundant? E.g. in do_fault();
>
> if (!vma->vm_ops->fault) {
> vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> vmf->address, &vmf->ptl);
> if (unlikely(!vmf->pte))
> ret = VM_FAULT_SIGBUS;
>
> And so can expect maybe some more redundancy if they also happen to map
> PMD-sized ranges? :)
>
> And the only two callers of vmf_insert_pfn_pmd() - hyperv and VFIO both
> implement actual fault handlers anyway.
>
> So I think this is fine?
>

I agree.

David, since Lorenzo also asked for your thoughts on the overall design
aspect ("sometimes THP and sometimes not"), what is your opinion on
this? Should we proceed with checking `!vma->vm_ops->fault` to
differentiate the deposit behavior for huge PFNMAPs?

>>
>> Do you have a preference among these, or perhaps another idea?
>>
>>>
>>> By the way, I am wondering if the prot bits are correctly preserved on page
>>> table deposit, as this is key for pfn map (e.g. if the range is uncached, for
>>> instance). That's something to check and ensure is correct.
>>>
>>> I _suspect_ they will be, as we have pretty well established mechanisms for that
>>> (propagate vma->vm_page_prot etc.) but definitely worth making sure.
>>>
>>
>> Yes, they are correctly preserved!
>>
>> During a PMD split in __split_huge_pmd_locked(), we populate the
>> deposited pgtable like this:
>>
>> entry = pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd));
>> set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
>>
>> The newly refactored pmd_pgprot() correctly extracts the exact
>> protection bits (including crucial cache modes like UC/WC for device
>> memory) from the huge PMD, strips the hardware-specific huge bit, and
>> returns a pure PTE-level pgprot_t.
>
> OK good :)
>
>>
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/20260228070906.1418911-5-yintirui@xxxxxxxxxx/
>>
>> --
>> Yin Tirui
>>
>
> Cheers, Lorenzo

--
Yin Tirui