Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

From: Huang, Kai

Date: Mon Jan 19 2026 - 05:40:56 EST


On Mon, 2026-01-19 at 18:11 +0800, Yan Zhao wrote:
> On Mon, Jan 19, 2026 at 04:49:58PM +0800, Huang, Kai wrote:
> > On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> > > On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > > > the guts of the TDP MMU.
> > > > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > > > and tail pages is asinine.
> > > > That's a reasonable concern. I actually thought about it.
> > > > My consideration was as follows:
> > > > Currently, we don't have such large areas. Usually, the conversion ranges are
> > > > less than 1GB. Though the initial conversion which converts all memory from
> > > > private to shared may be wide, there are usually no mappings at that stage. So,
> > > > the traversal should be very fast (since the traversal doesn't even need to go
> > > > down to the 2MB/1GB level).
> > > >
> > > > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > > > very large range at runtime, it can optimize by invoking the API twice:
> > > > once for range [start, ALIGN(start, 1GB)), and
> > > > once for range [ALIGN_DOWN(end, 1GB), end).
> > > >
> > > > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > > > by checking the range size if you think that would be better.
> > >
> > > I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> > > want to do optimization.
> > >
> > > I think I've raised this in v2, and asked why not just letting the caller
> > > to figure out the ranges to split for a given range (see at the end of
> > > [*]), because the "cross boundary" can only happen at the beginning and
> > > end of the given range, if possible.
> Hmm, the caller can only figure out when splitting is NOT necessary, e.g., if
> start is 1GB-aligned, then there's no need to split for start. However, if start
> is not 1GB/2MB-aligned, the caller has no idea if there's a 2MB mapping covering
> start - 1 and start.

Why does the caller need to know?

Let's only talk about 'start' for simplicity:

- If start is 1G aligned, then no split is needed.

- If start is not 1G-aligned but 2M-aligned, you split the range:

[ALIGN_DOWN(start, 1G), ALIGN(start, 1G)) to 2M level.

- If start is 4K-aligned only, you firstly split

[ALIGN_DOWN(start, 1G), ALIGN(start, 1G))

to 2M level, then you split

[ALIGN_DOWN(start, 2M), ALIGN(start, 2M))

to 4K level.

Similar handling to 'end'. An additional thing is if one to-be-split-
range calculated from 'start' overlaps one calculated from 'end', the
split is only needed once.

Wouldn't this work?

> (for non-TDX cases, if start is not 1GB-aligned and is just 2MB-aligned,
> invoking tdp_mmu_split_huge_pages_root() is still necessary because there may
> exist a 1GB mapping covering start -1 and start).
>
> In my reply to [*], I didn't want to do the calculation because I didn't see
> much overhead from always invoking tdp_mmu_split_huge_pages_root().
> But the scenario Sean pointed out is different. When both start and end are not
> 2MB-aligned, if [start, end) covers a huge range, we can still pre-calculate to
> reduce the iterations in tdp_mmu_split_huge_pages_root().

I don't see much difference. Maybe I am missing something.

>
> Opportunistically, optimization to skip splits for 1GB-aligned start or end is
> possible :)

If this makes code easier to review/maintain then sure.

As long as the solution is easy to review (i.e., not too complicated to
understand/maintain) then I am fine with whatever Sean/you prefer.

However the 'cross_boundary_only' thing was indeed a bit odd to me when I
firstly saw this :-)