Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

From: David Hildenbrand
Date: Tue Mar 18 2025 - 15:13:19 EST


On 18.03.25 03:24, Yan Zhao wrote:
On Fri, Mar 14, 2025 at 07:19:33PM +0800, Yan Zhao wrote:
On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote:
On 14.03.25 10:09, Yan Zhao wrote:
On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
(split is possible if there are no unexpected folio references; private
pages cannot be GUP'ed, so it is feasible)
...
Note that I'm not quite sure about the "2MB" interface, should it be
a
"PMD-size" interface?

I think Mike and I touched upon this aspect too - and I may be
misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
in increments -- and then fitting in PMD sizes when we've had enough of
those. That is to say he didn't want to preclude it, or gate the PMD
work on enabling all sizes first.

Starting with 2M is reasonable for now. The real question is how we want to
deal with
Hi David,


Hi!

I'm just trying to understand the background of in-place conversion.

Regarding to the two issues you mentioned with THP and non-in-place-conversion,
I have some questions (still based on starting with 2M):

(a) Not being able to allocate a 2M folio reliably
If we start with fault in private pages from guest_memfd (not in page pool way)
and shared pages anonymously, is it correct to say that this is only a concern
when memory is under pressure?

Usually, fragmentation starts being a problem under memory pressure, and
memory pressure can show up simply because the page cache makes us of as
much memory as it wants.

As soon as we start allocating a 2 MB page for guest_memfd, to then split it
up + free only some parts back to the buddy (on private->shared conversion),
we create fragmentation that cannot get resolved as long as the remaining
private pages are not freed. A new conversion from shared->private on the
previously freed parts will allocate other unmovable pages (not the freed
ones) and make fragmentation worse.
Ah, I see. The problem of fragmentation is because memory allocated by
guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is
still unmovable.

I previously thought fragmentation would only impact the guest by providing no
new huge pages. So if a confidential VM does not support merging small PTEs into
a huge PMD entry in its private page table, even if the new huge memory range is
physically contiguous after a private->shared->private conversion, the guest
still cannot bring back huge pages.

In-place conversion improves that quite a lot, because guest_memfd tself
will not cause unmovable fragmentation. Of course, under memory pressure,
when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But
then, we already had fragmentation (and did not really cause any new one).

We discussed in the upstream call, that if guest_memfd (primarily) only
allocates 2M pages and frees 2M pages, it will not cause fragmentation
itself, which is pretty nice.
Makes sense.


(b) Partial discarding
For shared pages, page migration and folio split are possible for shared THP?

I assume by "shared" you mean "not guest_memfd, but some other memory we use
Yes, not guest_memfd, in the case of non-in-place conversion.

as an overlay" -- so no in-place conversion.

Yes, that should be possible as long as nothing else prevents
migration/split (e.g., longterm pinning)


For private pages, as you pointed out earlier, if we can ensure there are no
unexpected folio references for private memory, splitting a private huge folio
should succeed.

Yes, and maybe (hopefully) we'll reach a point where private parts will not
have a refcount at all (initially, frozen refcount, discussed during the
last upstream call).
Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code
and found that partial splitting could work.

Are you concerned about the memory fragmentation after repeated
partial conversions of private pages to and from shared?

Not only repeated, even just a single partial conversion. But of course,
repeated partial conversions will make it worse (e.g., never getting a
private huge page back when there was a partial conversion).
Thanks for the explanation!

Do you think there's any chance for guest_memfd to support non-in-place
conversion first?
e.g. we can have private pages allocated from guest_memfd and allows the
private pages to be THP.

Meanwhile, shared pages are not allocated from guest_memfd, and let it only
fault in 4K granularity. (specify it by a flag?)

When we want to convert a 4K from a 2M private folio to shared, we can just
split the 2M private folio as there's no extra ref count of private pages;

Yes, IIRC that's precisely what this series is doing, because the ftruncate() will try splitting the folio (which might still fail on speculative references, see my comment as rely to this series)

In essence: yes, splitting to 4k should work (although speculative reference might require us to retry). But the "4k hole punch" is the ugly it.

So you really want in-place conversion where the private->shared will split (but not punch) and the shared->private will collapse again if possible.


when we do shared to private conversion, no split is required as shared pages
are in 4K granularity. And even if user fails to specify the shared pages as
small pages only, the worst thing is that a 2M shared folio cannot be split, and
more memory is consumed.

Of couse, memory fragmentation is still an issue as the private pages are
allocated unmovable.

Yes, and that you will never ever get a "THP" back when there was a conversion from private->shared of a single page that split the THP and discarded that page.

But do you think it's a good simpler start before in-place
conversion is ready?

There was a discussion on that on the bi-weekly upstream meeting on February the 6. The recording has more details, I summarized it as

"David: Probably a good idea to focus on the long-term use case where we have in-place conversion support, and only allow truncation in hugepage (e.g., 2 MiB) size; conversion shared<->private could still be done on 4 KiB granularity as for hugetlb."

In general, I think our time is better spent working on the real deal than on interim solutions that should not be called "THP support".

--
Cheers,

David / dhildenb