Re: [PATCH 10/16] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

From: Huang, Kai
Date: Thu May 16 2024 - 09:04:45 EST


On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
> >
> > I meant it seems we should just strip shared bit away from the GPA in
> > handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr
> > won't have the shared bit.
> >
> > Do you see any problem of doing so?
>
> We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().

I don't see any big difference?

Now in this patch the raw_gfn is directly from fault->addr:

raw_gfn = gpa_to_gfn(fault->addr);

tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
...
}

But there's nothing wrong to get the raw_gfn from the fault->gfn. In
fact, the zapping code just does this:

/*
* start and end doesn't have GFN shared bit. This function zaps
* a region including alias. Adjust shared bit of [start, end) if
* the root is shared.
*/
start = kvm_gfn_for_root(kvm, root, start);
end = kvm_gfn_for_root(kvm, root, end);

So there's nothing wrong to just do the same thing in both functions.

The point is fault->gfn has shared bit stripped away at the beginning, and
AFAICT there's no useful reason to keep shared bit in fault->addr. The
entire @fault is a temporary structure on the stack during fault handling
anyway.

>
> In the past I did something like the private/shared split, but for execute-only
> aliases and a few other wacky things.
>
> It also had a synthetic error code. For awhile I had it so GPA had alias bits
> (i.e. shared bit) not stripped, like TDX has today, but there was always some
> code that got surprised by the extra bits in the GPA. I want to say it was the
> emulation of PAE or something like that (execute-only had to support all the
> normal VM stuff).
>
> So in the later revisions I actually had a helper to take a GFN and PF error
> code and put the alias bits back in. Then alias bits got stripped immediately
> and at the same time the synthetic error code was set. Something similar could
> probably work to recreate "raw_gfn" from a fault.
>
> IIRC (and I could easily be wrong), when I discussed this with Sean he said TDX
> didn't need to support whatever issue I was working around, and the original
> solution was slightly better for TDX.
>
> In any case, I doubt Sean is wedded to a remark he may or may not have made long
> ago. But looking at the TDX code today, it doesn't feel that confusing to me.

[...]

>
> So I'm not against adding the shared bits back in later, but it doesn't seem
> that big of a gain to me. It also has kind of been tried before a long time ago.

As mentioned above, we are already doing that anyway in the zapping code
path.

>
> >
> > >
> > > >
> > > > >           }
> > > > >    
> > > > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > > > > index fae559559a80..8a64bcef9deb 100644
> > > > > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > > > > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > > > > @@ -91,7 +91,7 @@ struct tdp_iter {
> > > > >           tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> > > > >           /* A pointer to the current SPTE */
> > > > >           tdp_ptep_t sptep;
> > > > > -       /* The lowest GFN mapped by the current SPTE */
> > > > > +       /* The lowest GFN (shared bits included) mapped by the current
> > > > > SPTE
> > > > > */
> > > > >           gfn_t gfn;
> > > >
> > > > IMHO we need more clarification of this design.
> > >

Btw, another thing after second thought:

So regardless of how to implement in KVM, IIUC TDX hardware requires below
two operations to have the shared bit set in the GPA for shared mapping:

1) Setup/teardown shared page table mapping
2) GPA range in TLB flush for shared mapping

(I kinda forgot the TLB flush part so better double check, but I guess I
am >90% sure about it.)

So in the fault handler path, we actually need to be careful of the GFN
passed to relevant functions, because for other operations like finding
memslot based on GFN, we must pass the GFN w/o shared bit.

Now the tricky thing is due to 1) the 'tdp_iter->gfn' is set to the
"raw_gfn" with shared bit in order to find the correct SPTE in the fault
handler path. And as a result, the current implementation sets the sp-
>gfn to the "raw_gfn" too.

sp = tdp_mmu_alloc_sp(vcpu);
...
tdp_mmu_init_child_sp(sp, &iter);

The problem is in current KVM implementation, iter->gfn and sp->gfn are
used in both cases: 1) page table walk and TLB flush; 2) others like
memslot lookup.

So the result is we need to be very careful whether we should strip the
shared bit away when using them.

E.g., Looking at the current dev branch, if I am reading code correctly,
it seems we have bug around here:

static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault,
struct tdp_iter *iter)
{
...

if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL,
iter->gfn, fault->pfn, iter->old_spte,
fault->prefetch, true,
fault->map_writable, &new_spte);
...
}

See @iter->gfn (which is "raw_gfn" AFAICT) is passed to both
make_mmio_spte() and make_spte(). But AFAICT both the two functions treat
GFN as the actual GFN. E.g.,

bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
const struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
u64 old_spte, bool prefetch, bool can_unsync,
bool host_writable, u64 *new_spte)
{
...

if (shadow_memtype_mask)
spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
...

if ((spte & PT_WRITABLE_MASK) &&
kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
WARN_ON_ONCE(level > PG_LEVEL_4K);
mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
}
...
}

AFAICT both @gfn in kvm_x86_get_mt_mask() and mark_page_dirty_in_slot()
needs the actual GFN. They may not be a concern for TDX now, but I think
it's logically wrong to use the raw GFN.

This kinda issue is hard to find in code writing and review. I am
thinking whether we should have a more clear way to avoid such issues.

The idea is to add a new 'raw_gfn' to @tdp_iter and 'kvm_mmu_page'. When
we walk the GFN range using iter, we always use the "actual GFN" w/o
shared bit. Like:

tdp_mmu_for_each_pte(kvm, iter, mmu, is_private, gfn, gfn + 1) {
...
}

But in the tdp_iter_*() functions, we internally calculate the "raw_gfn"
using the "actual GFN" + the 'kvm', and we use the "raw_gfn" to walk the
page table to find the correct SPTE.

So the end code will be: 1) explicitly use iter->raw_gfn for page table
walk and do TLB flush; 2) For all others like memslot lookup, use iter-
>gfn.

(sp->gfn and sp->raw_gfn can be used similarly, e.g., sp->raw_gfn is used
for TLB flush, and for others like memslot lookup we use sp->gfn.)

I think in this way the code will be more clear?