Re: [RFC PATCH v2 41/69] KVM: x86: Add infrastructure for stolen GPA bits
From: Sean Christopherson
Date: Fri Aug 06 2021 - 15:02:48 EST
On Fri, Aug 06, 2021, Kai Huang wrote:
> On Thu, 5 Aug 2021 16:06:41 +0000 Sean Christopherson wrote:
> > On Thu, Aug 05, 2021, Kai Huang wrote:
> > > On Fri, 2 Jul 2021 15:04:47 -0700 isaku.yamahata@xxxxxxxxx wrote:
> > > > From: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
> > > > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > sp = kvm_mmu_alloc_page(vcpu, direct);
> > > >
> > > > sp->gfn = gfn;
> > > > + sp->gfn_stolen_bits = gfn_stolen_bits;
> > > > sp->role = role;
> > > > hlist_add_head(&sp->hash_link, sp_list);
> > > > if (!direct) {
> > > > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > return sp;
> > > > }
> > >
> > >
> > > Sorry for replying old thread,
> >
> > Ha, one month isn't old, it's barely even mature.
> >
> > > but to me it looks weird to have gfn_stolen_bits
> > > in 'struct kvm_mmu_page'. If I understand correctly, above code basically
> > > means that GFN with different stolen bit will have different 'struct
> > > kvm_mmu_page', but in the context of this patch, mappings with different
> > > stolen bits still use the same root,
> >
> > You're conflating "mapping" with "PTE". The GFN is a per-PTE value. Yes, there
> > is a final GFN that is representative of the mapping, but more directly the final
> > GFN is associated with the leaf PTE.
> >
> > TDX effectively adds the restriction that all PTEs used for a mapping must have
> > the same shared/private status, so mapping and PTE are somewhat interchangeable
> > when talking about stolen bits (the shared bit), but in the context of this patch,
> > the stolen bits are a property of the PTE.
>
> Yes it is a property of PTE, this is the reason that I think it's weird to have
> stolen bits in 'struct kvm_mmu_page'. Shouldn't stolen bits in 'struct
> kvm_mmu_page' imply that all PTEs (whether leaf or not) share the same
> stolen bit?
No, the stolen bits are the property of the shadow page. I'm using "PTE" above
to mean "PTE for this shadow page", not PTEs within the shadow page, if that makes
sense.
> > Back to your statement, it's incorrect. PTEs (effectively mappings in TDX) with
> > different stolen bits will _not_ use the same root. kvm_mmu_get_page() includes
> > the stolen bits in both the hash lookup and in the comparison, i.e. restores the
> > stolen bits when looking for an existing shadow page at the target GFN.
> >
> > @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > role.quadrant = quadrant;
> > }
> >
> > - sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > + sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
> > for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > - if (sp->gfn != gfn) {
> > + if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
> > collisions++;
> > continue;
> > }
> >
>
> This only works for non-root table, but there's only one single
> vcpu->arch.mmu->root_hpa, we don't have an array to have one root for each
> stolen bit, i.e. do a loop in mmu_alloc_direct_roots(), so effectively all
> stolen bits share one single root.
Yes, and that's absolutely the required behavior for everything except for TDX
with its two EPTPs. E.g. any other implement _must_ reject CR3s that set stolen
gfn bits.
> > > which means gfn_stolen_bits doesn't make a lot of sense at least for root
> > > page table.
> >
> > It does make sense, even without a follow-up patch. In Rick's original series,
> > stealing a bit for execute-only guest memory, there was only a single root. And
> > except for TDX, there can only ever be a single root because the shared EPTP isn't
> > usable, i.e. there's only the regular/private EPTP.
> >
> > > Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
> > > the context of TDX, since TDX requires two separate roots for private and
> > > shared mappings.
> >
> > > So given we cannot tell whether the same root, or different roots should be
> > > used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
> > > 'struct kvm_mmu_page' and use it to determine whether to allocate a new table
> > > for the same GFN, but should use a new role (i.e role.private) to determine.
> >
> > A new role would work, too, but it has the disadvantage of not automagically
> > working for all uses of stolen bits, e.g. XO support would have to add another
> > role bit.
>
> For each purpose of particular stolen bit, a new role can be defined. For
> instance, in __direct_map(), if you see stolen bit is TDX shared bit, you don't
> set role.private (otherwise set role.private). For XO, if you see the stolen
> bit is XO, you set role.xo.
>
> We already have info of 'gfn_stolen_mask' in vcpu, so we just need to make sure
> all code paths can find the actual stolen bit based on sp->role and vcpu (I
> haven't gone through all though, assuming the annoying part is rmap).
Yes, and I'm not totally against the idea, but I'm also not 100% sold on it either,
yet... The idea of a 'private' flag is growing on me.
If we're treating the shared bit as an attribute bit, which we are, then it's
effectively an extension of role.access. Ditto for XO.
And looking at the code, I think it would be an improvement for TDX, as all of
the is_private_gfn() calls that operate on a shadow page would be simplified and
optimized as they wouldn't have to lookup both gfn_stolen_bits and the vcpu->kvm
mask of the shared bit.
Actually, the more I think about it, the more I like it. For TDX, there's no
risk of increased hash collisions, as we've already done messed up if there's a
shared vs. private collision.
And for XO, if it ever makes it way upstream, I think we should flat out disallow
referencing XO addresses in non-leaf PTEs, i.e. make the XO permission bit reserved
in non-leaf PTEs. That would avoid any theoretical problems with the guest doing
something stupid by polluting all its upper level PxEs with XO. Collisions would
be purely limited to the case where the guest is intentionally creating an alternate
mapping, which should be a rare event (or the guest is comprosied, which is also
hopefully a rare event).