On Fri, Mar 14, 2025 at 07:19:33PM +0800, Yan Zhao wrote:
On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote:e.g. we can have private pages allocated from guest_memfd and allows the
On 14.03.25 10:09, Yan Zhao wrote:Ah, I see. The problem of fragmentation is because memory allocated by
On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
(split is possible if there are no unexpected folio references; private...
pages cannot be GUP'ed, so it is feasible)
Hi David,Note that I'm not quite sure about the "2MB" interface, should it be
a
"PMD-size" interface?
I think Mike and I touched upon this aspect too - and I may be
misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
in increments -- and then fitting in PMD sizes when we've had enough of
those. That is to say he didn't want to preclude it, or gate the PMD
work on enabling all sizes first.
Starting with 2M is reasonable for now. The real question is how we want to
deal with
Hi!
I'm just trying to understand the background of in-place conversion.
Regarding to the two issues you mentioned with THP and non-in-place-conversion,
I have some questions (still based on starting with 2M):
(a) Not being able to allocate a 2M folio reliablyIf we start with fault in private pages from guest_memfd (not in page pool way)
and shared pages anonymously, is it correct to say that this is only a concern
when memory is under pressure?
Usually, fragmentation starts being a problem under memory pressure, and
memory pressure can show up simply because the page cache makes us of as
much memory as it wants.
As soon as we start allocating a 2 MB page for guest_memfd, to then split it
up + free only some parts back to the buddy (on private->shared conversion),
we create fragmentation that cannot get resolved as long as the remaining
private pages are not freed. A new conversion from shared->private on the
previously freed parts will allocate other unmovable pages (not the freed
ones) and make fragmentation worse.
guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is
still unmovable.
I previously thought fragmentation would only impact the guest by providing no
new huge pages. So if a confidential VM does not support merging small PTEs into
a huge PMD entry in its private page table, even if the new huge memory range is
physically contiguous after a private->shared->private conversion, the guest
still cannot bring back huge pages.
In-place conversion improves that quite a lot, because guest_memfd tselfMakes sense.
will not cause unmovable fragmentation. Of course, under memory pressure,
when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But
then, we already had fragmentation (and did not really cause any new one).
We discussed in the upstream call, that if guest_memfd (primarily) only
allocates 2M pages and frees 2M pages, it will not cause fragmentation
itself, which is pretty nice.
Yes, not guest_memfd, in the case of non-in-place conversion.
(b) Partial discardingFor shared pages, page migration and folio split are possible for shared THP?
I assume by "shared" you mean "not guest_memfd, but some other memory we use
as an overlay" -- so no in-place conversion.Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code
Yes, that should be possible as long as nothing else prevents
migration/split (e.g., longterm pinning)
For private pages, as you pointed out earlier, if we can ensure there are no
unexpected folio references for private memory, splitting a private huge folio
should succeed.
Yes, and maybe (hopefully) we'll reach a point where private parts will not
have a refcount at all (initially, frozen refcount, discussed during the
last upstream call).
and found that partial splitting could work.
Are you concerned about the memory fragmentation after repeatedThanks for the explanation!
partial conversions of private pages to and from shared?
Not only repeated, even just a single partial conversion. But of course,
repeated partial conversions will make it worse (e.g., never getting a
private huge page back when there was a partial conversion).
Do you think there's any chance for guest_memfd to support non-in-place
conversion first?
private pages to be THP.
Meanwhile, shared pages are not allocated from guest_memfd, and let it only
fault in 4K granularity. (specify it by a flag?)
When we want to convert a 4K from a 2M private folio to shared, we can just
split the 2M private folio as there's no extra ref count of private pages;
when we do shared to private conversion, no split is required as shared pages
are in 4K granularity. And even if user fails to specify the shared pages as
small pages only, the worst thing is that a 2M shared folio cannot be split, and
more memory is consumed.
Of couse, memory fragmentation is still an issue as the private pages are
allocated unmovable.
conversion is ready?