Re: [PATCH] KVM: SVM: Use a separate vmcb for the nested L2 guest

From: Cathy Avery
Date: Thu Oct 08 2020 - 09:52:39 EST


On 10/8/20 9:11 AM, Maxim Levitsky wrote:
On Thu, 2020-10-08 at 08:46 -0400, Cathy Avery wrote:
On 10/8/20 6:54 AM, Maxim Levitsky wrote:
On Thu, 2020-10-08 at 13:39 +0300, Maxim Levitsky wrote:
On Thu, 2020-10-08 at 13:23 +0300, Maxim Levitsky wrote:
On Thu, 2020-10-08 at 07:52 +0200, Paolo Bonzini wrote:
On 08/10/20 00:14, Maxim Levitsky wrote:
+ if (svm->vmcb01->control.asid == 0)
+ svm->vmcb01->control.asid = svm->nested.vmcb02->control.asid;
I think that the above should be done always. The asid field is currently host
controlled only (that is L2 value is ignored, selective ASID tlb flush is not
advertized to the guest and lnvlpga is emulated as invlpg).
Yes, in fact I suggested that ASID should be in svm->asid and moved to
svm->vmcb->asid in svm_vcpu_run. Then there's no need to special case
it in nested code.
This makes lot of sense!
This should be a patch coming before this one.

1. Something wrong with memory types - like guest is using UC memory for everything.
I can't completely rule that out yet
You can print g_pat and see if it is all zeroes.
I don't even need to print it. I know that it is never set anywhere, unless guest writes it,
but now that I look at it, we set it to a default value and there is no code to set it to
default value for vmcb02. This is it. now my fedora guest boots just fine!

I played a lot with g_pat, and yet this didn't occur to me . I was that close :-(
I knew that it has to be something with memory types, but it never occured to me
that guest just doesn't write IA32_PAT and uses our value which we set in init_vmcb


In general I think it's better to be explicit with vmcb01 vs. vmcb02,
like Cathy did, but I can see it's a matter of personal preference to
some extent.
I also think so in general, but in the code that is outside 'is_guest_mode'
IMHO it is better to refer to vmcb01 as vmcb, although now that I think of
it, its not wrong to do so either.


My windows hyper-v guest doesn't boot though and I know why.

As we know the vmcb save area has extra state which vmrun/vmexit don't touch.
Now suppose a nested hypervisor wants to enter a nested guest.

It will do vmload, which will load the extra state from the nested vmcb (vmcb12
or as I woudl say the vmcb that nested hypervisor thinks that it is using),
to the CPU. This can cause some vmexits I think, but this doesn't matter much.

Now the nested hypervisor does vmrun. The extra state of L2 guest is in CPU registers,
and it is untouched. We do vmsave on vmcb01 to preserve that state, but later
when we do vmload on vmcb02 prior to vmenter on it, which loads stale state from it.
The same issue happens the other way around on nested vmexit.

I fixed this by doing nested_svm_vmloadsave, but that should be probably be
optimized with dirty bits. Now though I guess the goal it to make
it work first.

With this fixed HyperV boots fine, and even passes the 'works' test of booting
the windows 10 with hyperv enabled nested itself and starting the vm inside,
which makes that VM L3 (in addition to windows itself that runs as L3 in relation to hyper-v)

https://i.imgur.com/sRYqtVV.png

In summary this is the diff of fixes (just pasted to email, probably mangled):


diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 0a06e62010d8c..7293ba23b3cbc 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -436,6 +436,9 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
WARN_ON(svm->vmcb == svm->nested.vmcb02);
svm->nested.vmcb02->control = svm->vmcb01->control;
+
+ nested_svm_vmloadsave(svm->vmcb01, svm->nested.vmcb02);
+
svm->vmcb = svm->nested.vmcb02;
svm->vmcb_pa = svm->nested.vmcb02_pa;
load_nested_vmcb_control(svm, &nested_vmcb->control);
@@ -622,6 +625,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
if (svm->vmcb01->control.asid == 0)
svm->vmcb01->control.asid = svm->nested.vmcb02->control.asid;
+ nested_svm_vmloadsave(svm->nested.vmcb02, svm->vmcb01);
svm->vmcb = svm->vmcb01;
svm->vmcb_pa = svm->nested.vmcb01_pa;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b66239b26885d..ee9f87fe611f2 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1097,6 +1097,7 @@ static void init_vmcb(struct vcpu_svm *svm)
clr_cr_intercept(svm, INTERCEPT_CR3_READ);
clr_cr_intercept(svm, INTERCEPT_CR3_WRITE);
save->g_pat = svm->vcpu.arch.pat;
+ svm->nested.vmcb02->save.g_pat = svm->vcpu.arch.pat;
I had noticed that g_pat was not set in vmcb02 and set it to vmcb01's
value which was not helpful but I did not try the current vcpu value.

I am getting a #UD which I suspected had something to do with cr3 but
I'll know more after I add your suggestions.

emu-system-x86-1647 [033] .... 3167.589402: kvm_nested_vmexit_inject:
reason: UD excp ext_inf1: 0x0000000000000000 ext_inf2:
0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000


save->cr3 = 0;
save->cr4 = 0;
}



Best regards,
Maxim Levitsky

Paolo

And another thing I spotted before I forget.

If we setup a tlb flush in ctl.tlb_ctl of vmcb01, just prior to nested vmentry
then this field will be copied to vmcb02 but on next vmexit we clear it in current
(that is vmcb02) and that change will not propogate to vmcb01.
ctl.tlb_ctl is dependent on the value of save.cr4 which was not being
set in vmcb02.
Not sure I understand. Could you explain?

The vmcb02.save.cr4 is set in nested_prepare_vmcb_save, which does
svm_set_cr4(&svm->vcpu, nested_vmcb->save.cr4),
which eventually does 'to_svm(vcpu)->vmcb->save.cr4 = cr4;'
And in this point vmcb points to vmcb02.

Yes it points to vmcb02 but

int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
        unsigned long host_cr4_mce = cr4_read_shadow() & X86_CR4_MCE;
        unsigned long old_cr4 = to_svm(vcpu)->vmcb->save.cr4;  <--- this vmcb02 was never initialized and will always be zero. Where previously it would have contained L1.save.cr4 and tbl_ctl is set to something other than TLB_CONTROL_DO_NOTHING. I saw this in my traces.


Besides SEV, it seems like ctl.tlb_ctl is set by svm_flush_tlb,
which is indeed called from svm_set_cr4, but it is also exposed
but .tlb_flush* callbacks from KVM core.

What I meant is that we can create vcpu->arch->needs_tlb_flush, set it in svm_flush_tlb instead of tlb_ctl,
and then svm_vcpu_run can set tlb_ctl in the current ctl to that TLB_CONTROL_FLUSH_ASID when
vcpu->arch->needs_tlb_flush is set.

Best regards,
Maxim Levitsky


I am not sure if this is a theorerical issue or not. We probably should apply the same treatment to
it as what Paulo suggested to do with asid -
set it just prior to vmentry if tlb flush is needed, and clear it afterwards as we do.
And yet another thing to note is that we curently ignore L2's g_pat. However it _seems_ that we practically
ignore L1 PAT as well in regard to shadowing NPT mmu. I am not 100% sure about this.
I'll dig that area eventually.

Best regards,
Maxim Levitsky

Best regards,
Maxim Levitsky
Thanks,

Cathy