Re: [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use
From: Maxim Levitsky
Date: Wed May 26 2021 - 10:41:19 EST
On Mon, 2021-05-24 at 14:44 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@xxxxxxxxxx> writes:
> > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> > > Changes since v1 (Sean):
> > > - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
> > > - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
> > > copy_enlightened_to_vmcs12().
> > >
> > > Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
> > > migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
> > > + WSL2) was crashing immediately after migration. It was also reported
> > > that we have more issues to fix as, while the failure rate was lowered
> > > signifincatly, it was still possible to observe crashes after several
> > > dozens of migration. Turns out, the issue arises when we manage to issue
> > > KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
> > > to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
> > > the flag itself is not part of saved nested state. A few other less
> > > significant issues are fixed along the way.
> > >
> > > While there's no proof this series fixes all eVMCS related problems,
> > > Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
> > > crashing in testing.
> > >
> > > Patches are based on the current kvm/next tree.
> > >
> > > Vitaly Kuznetsov (7):
> > > KVM: nVMX: Introduce nested_evmcs_is_used()
> > > KVM: nVMX: Release enlightened VMCS on VMCLEAR
> > > KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
> > > vmx_get_nested_state()
> > > KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
> > > KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
> > > KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
> > > KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
> > > lost
> > >
> > > arch/x86/kvm/vmx/nested.c | 110 ++++++++++++------
> > > .../testing/selftests/kvm/x86_64/evmcs_test.c | 64 +++++-----
> > > 2 files changed, 115 insertions(+), 59 deletions(-)
> > >
> > Hi Vitaly!
> > In addition to the review of this patch series,
> Thanks by the way!
> > I would like
> > to share an idea on how to avoid the hack of mapping the evmcs
> > in nested_vmx_vmexit, because I think I found a possible generic
> > solution to this and similar issues:
> > The solution is to always set nested_run_pending after
> > nested migration (which means that we won't really
> > need to migrate this flag anymore).
> > I was thinking a lot about it and I think that there is no downside to this,
> > other than sometimes a one extra vmexit after migration.
> > Otherwise there is always a risk of the following scenario:
> > 1. We migrate with nested_run_pending=0 (but don't restore all the state
> > yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
> > or just the guest memory map is not up to date, guest is in smm or something
> > like that)
> > 2. Userspace calls some ioctl that causes a nested vmexit
> > This can happen today if the userspace calls
> > kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events
> > 3. Userspace finally sets correct guest's msrs, correct guest memory map and only
> > then calls KVM_RUN
> > This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
> > if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
> > but we have to do so to complete the nested vmexit.
> Why do we need to write to eVMCS to complete vmexit? AFAICT, there's
> only one place which calls copy_vmcs12_to_enlightened():
> nested_sync_vmcs12_to_shadow() which, in its turn, has only 1 caller:
> vmx_prepare_switch_to_guest() so unless userspace decided to execute
> not-fully-restored guest this should not happen. I'm probably missing
> something in your scenario)
You are right!
The evmcs write is delayed to the next vmentry.
However since we are now mapping the evmcs during nested vmexit,
and this can fail for example that HV assist msr is not up to date.
For example consider this:
1. Userspace first sets nested state
2. Userspace calls KVM_GET_MP_STATE.
3. Nested vmexit that happened in 2 will end up not be able to map the evmcs,
since HV_ASSIST msr is not yet loaded.
Also the vmcb write (that is for SVM) _is_ done right away on nested vmexit
and conceptually has the same issue.
(if memory map is not up to date, we might not be able to read/write the
vmcb12 on nested vmexit)
> > To some extent, the entry to the nested mode after a migration is only complete
> > when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.
> > This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
> > nested vmexit path at all.
> Remember, we have three possible states when nested state is
> 1) L2 was running
> 2) L1 was running
> 3) We're in beetween L2 and L1 (need_vmcs12_to_shadow_sync = true).
I understand. This suggestion wasn't meant to fix the case 3, but more to fix
case 1, where we are in L2, migrate, and then immediately decide to
do a nested vmexit before we processed the KVM_REQ_GET_NESTED_STATE_PAGES
request, and also before potentially before the guest state was fully uploaded
(see that KVM_GET_MP_STATE thing).
In a nutshell, I vote for not allowing nested vmexits from the moment
when we set the nested state and until the moment we enter the nested
guest once (maybe with request for immediate vmexit),
because during this time period, the guest state is not fully consistent.
> Is 'nested_run_pending' suitable for all of them? Could you maybe draft
> a patch so we can see how this works (in both 'normal' and 'evmcs'