Re: [PATCH 0/2] KVM: x86: Fix and cleanup for recent AVIC changes
From: Maxim Levitsky
Date:  Tue Oct 12 2021 - 05:53:18 EST
On Mon, 2021-10-11 at 16:58 +0000, Sean Christopherson wrote:
> On Mon, Oct 11, 2021, Sean Christopherson wrote:
> > On Sun, Oct 10, 2021, Maxim Levitsky wrote:
> > > On Fri, 2021-10-08 at 18:01 -0700, Sean Christopherson wrote:
> > > > Belated "code review" for Maxim's recent series to rework the AVIC inhibit
> > > > code.  Using the global APICv status in the page fault path is wrong as
> > > > the correct status is always the vCPU's, since that status is accurate
> > > > with respect to the time of the page fault.  In a similar vein, the code
> > > > to change the inhibit can be cleaned up since KVM can't rely on ordering
> > > > between the update and the request for anything except consumers of the
> > > > request.
> > > > 
> > > > Sean Christopherson (2):
> > > >   KVM: x86/mmu: Use vCPU's APICv status when handling APIC_ACCESS
> > > >     memslot
> > > >   KVM: x86: Simplify APICv update request logic
> > > > 
> > > >  arch/x86/kvm/mmu/mmu.c |  2 +-
> > > >  arch/x86/kvm/x86.c     | 16 +++++++---------
> > > >  2 files changed, 8 insertions(+), 10 deletions(-)
> > > > 
> > > 
> > > Are you sure about it? Let me explain how the algorithm works:
> > > 
> > > - kvm_request_apicv_update:
> > > 
> > > 	- take kvm->arch.apicv_update_lock
> > > 
> > > 	- if inhibition state doesn't really change (kvm->arch.apicv_inhibit_reasons still zero or non zero)
> > > 		- update kvm->arch.apicv_inhibit_reasons
> > > 		- release the lock
> > > 
> > > 	- raise KVM_REQ_APICV_UPDATE
> > > 		* since kvm->arch.apicv_update_lock is taken, all vCPUs will be
> > > 		kicked out of guest mode and will be either doing someing in
> > > 		the KVM (like page fault) or stuck on trying to process that
> > > 		request the important thing is that no vCPU will be able to get
> > > 		back to the guest mode.
> > > 
> > > 	- update the kvm->arch.apicv_inhibit_reasons
> > > 		* since we hold vm->arch.apicv_update_lock vcpus can't see the new value
> > 
> > This assertion is incorrect, kvm_apicv_activated() is not guarded by the lock.
> > 
> > > 	- update the SPTE that covers the APIC's mmio window:
> > 
> > This won't affect in-flight page faults.
> > 
> > 
> >    vCPU0                               vCPU1
> >    =====                               =====
> >    Disabled APICv
> >    #NPT                                Acquire apicv_update_lock
> >                                        Re-enable APICv
> >    kvm_apicv_activated() == false
> 
> Doh, that's supposed to be "true".
> 
> >    incorrectly handle as regular MMIO
> >                                        zap APIC pages
> >    MMIO cache has bad entry
> 
> Argh, I forgot the memslot is still there, so the access won't be treated as MMIO
> and thus won't end up in the MMIO cache.
> 
> So I agree that the code is functionally ok, but I'd still prefer to switch to
> kvm_vcpu_apicv_active() so that this code is coherent with respect to the APICv
> status at the time the fault occurred.
> 
> My objection to using kvm_apicv_activated() is that the result is completely
> non-deterministic with respect to the vCPU's APICv status at the time of the
> fault.  It works because all of the other mechanisms that are in place, e.g.
> elevating the MMU notifier count, but the fact that the result is non-deterministic
> means that using the per-vCPU status is also functionally ok.
The problem is that it is just not correct to use local AVIC enable state 
to determine if we want to populate the SPTE or or just jump to the emulation.
For example, assuming that the AVIC is now enabled on all vCPUs,
we can have this scenario:
    vCPU0                                   vCPU1
    =====                                   =====
- disable AVIC
- VMRUN
                                        - #NPT on AVIC MMIO access
                                        - *stuck on something prior to the page fault code*
- enable AVIC
- VMRUN
                                        - *still stuck on something prior to the page fault code*
- disable AVIC:
  - raise KVM_REQ_APICV_UPDATE request
					
  - set global avic state to disable
  - zap the SPTE (does nothing, doesn't race
	with anything either)
  - handle KVM_REQ_APICV_UPDATE -
    - disable vCPU0 AVIC
- VMRUN
					- *still stuck on something prior to the page fault code*
                                                            ...
                                                            ...
                                                            ...
                                        - now vCPU1 finally starts running the page fault code.
                                        - vCPU1 AVIC is still enabled 
                                          (because vCPU1 never handled KVM_REQ_APICV_UPDATE),
                                          so the page fault code will populate the SPTE.
                                          
                                        - handle KVM_REQ_APICV_UPDATE
                                           - finally disable vCPU1 AVIC
                                        - VMRUN (vCPU1 AVIC disabled, SPTE populated)
					                 ***boom***
> 
> At a minimum, I'd like to add a blurb in the kvm_faultin_pfn() comment to call out
> the reliance on mmu_notifier_seq.
This is a very good idea!
> 
> E.g. if kvm_zap_gfn_range() wins the race to acquire mmu_lock() after APICv is
> inhibited/disabled by __kvm_request_apicv_update(), then direct_page_fault() will
> retry the fault due to the change in mmu_notifier_seq.  If direct_page_fault()
> wins the race, then kvm_zap_gfn_range() will zap the freshly-installed SPTEs.
> For the uninhibit/enable case, at worst KVM will emulate an access that could have
> been accelerated by retrying the instruction.
Yes, 100% agree. 
The thing was super tricky to implement to avoid races that happen otherwise
this way or another.
Best regards,
	Maxim Levitsky
>