Re: [PATCH] KVM: guest_memfd: Disable VMA merging with VM_DONTEXPAND

From: Ackerley Tng

Date: Wed Feb 04 2026 - 18:17:32 EST


"David Hildenbrand (arm)" <david@xxxxxxxxxx> writes:

> On 2/4/26 22:37, Sean Christopherson wrote:
>> On Wed, Feb 04, 2026, Ackerley Tng wrote:
>>> Ackerley Tng <ackerleytng@xxxxxxxxxx> writes:
>>>
>>>> #syz test: git://git.kernel.org/pub/scm/virt/kvm/kvm.git next
>>>>
>>>> guest_memfd VMAs don't need to be merged,
>>
>> Why not? There are benefits to merging VMAs that have nothing to do with folios.
>> E.g. map 1GiB of guest_memfd with 512*512 4KiB VMAs, and then it becomes quite
>> desirable to merge all of those VMAs into one.
>>

I didn't realise VM_DONTEXPAND's no expansion policy extends to the case
where adjacent VMAs with the same flags, etc automatically merge. Since
VM_DONTEXPAND blocks this kind of expansion, I agree VM_DONTEXPAND is
not great.

>> Creating _hugepages_ doesn't add value, but that's not the same things as merging
>> VMAs.
>>
>>>> especially now, since guest_memfd only supports PAGE_SIZE folios.
>>>>
>>>> Set VM_DONTEXPAND on guest_memfd VMAs.
>>>
>>> Local tests and syzbot agree that this fixes the issue identified. :)
>>>
>>> I would like to look into madvise(MADV_COLLAPSE) and uprobes triggering
>>> mapping/folio collapsing before submitting a full patch series.
>>>
>>> David, Michael, Vishal, what do you think of the choice of setting
>>> VM_DONTEXPAND to disable khugepaged?
>>
>> I'm not one of the above, but for me it feels very much like treating a symptom

Was going to find some solution before getting to you to save you some
time :)

>> and not fixing the underlying cause.
>
> And you are spot-on :)
>
>>
>> It seems like what KVM should do is not block one path that triggers hugepage
>> processing, but instead flat out disallow creating hugepages. Unfortunately,

__filemap_get_folio_mpol(), which we use in kvm_gmem_get_folio(), looks
up mapping_min_folio_order() to determine what order to allocate. I
think we could lock that down to always use order 0. I tried that here
[1] but in this case khugepaged allocates new folios for guest_memfd
(and others) directly in collapse_file(), explicitly specifying
PMD_ORDER.

I took a look and wasn't able to find a central callback/ops to catch
all fs allocations.

[1] https://lore.kernel.org/all/6982553e.a00a0220.34fa92.0009.GAE@xxxxxxxxxx/

>> AFAICT, there's no existing way to prevent madvise() from clearing VM_NOHUGEPAGE,
>> so we can't simply force that flag.
>>
>> I'd prefer not to special case guest_memfd, a la devdax, but I also want to address
>> this head-on, not by removing a tangentially related trigger.
>
> VM_NOHUGEPAGE also smells like the wrong thing. This is a file limitation.
>
> !thp_vma_allowable_order() must take care of that somehow down in
> __thp_vma_allowable_orders(), by checking the file).
>
> Likely the file_thp_enabled() check is the culprit with
> CONFIG_READ_ONLY_THP_FOR_FS?
>
> Maybe we need a flag to say "even not CONFIG_READ_ONLY_THP_FOR_FS".
>
> I wonder how we handle that for secretmem. Too late for me, going to bed :)
>

Let me look deeper into this. Thanks!

> --
> Cheers,
>
> David