Re: [PATCH 3/3] KVM: LAPIC: Optimize timer latency further

From: Wanpeng Li
Date: Tue May 14 2019 - 06:56:32 EST


On Tue, 14 May 2019 at 09:45, Wanpeng Li <kernellwp@xxxxxxxxx> wrote:
>
> On Tue, 14 May 2019 at 03:54, Sean Christopherson
> <sean.j.christopherson@xxxxxxxxx> wrote:
> >
> > On Thu, May 09, 2019 at 07:29:21PM +0800, Wanpeng Li wrote:
> > > From: Wanpeng Li <wanpengli@xxxxxxxxxxx>
> > >
> > > Advance lapic timer tries to hidden the hypervisor overhead between host
> > > timer fires and the guest awares the timer is fired. However, it just hidden
> > > the time between apic_timer_fn/handle_preemption_timer -> wait_lapic_expire,
> > > instead of the real position of vmentry which is mentioned in the orignial
> > > commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer
> > > expiration"). There is 700+ cpu cycles between the end of wait_lapic_expire
> > > and before world switch on my haswell desktop, it will be 2400+ cycles if
> > > vmentry_l1d_flush is tuned to always.
> > >
> > > This patch tries to narrow the last gap, it measures the time between
> > > the end of wait_lapic_expire and before world switch, we take this
> > > time into consideration when busy waiting, otherwise, the guest still
> > > awares the latency between wait_lapic_expire and world switch, we also
> > > consider this when adaptively tuning the timer advancement. The patch
> > > can reduce 50% latency (~1600+ cycles to ~800+ cycles on a haswell
> > > desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits.
> > >
> > > Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> > > Cc: Radim KrÄmÃÅ <rkrcmar@xxxxxxxxxx>
> > > Cc: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > > Cc: Liran Alon <liran.alon@xxxxxxxxxx>
> > > Signed-off-by: Wanpeng Li <wanpengli@xxxxxxxxxxx>
> > > ---
> > > arch/x86/kvm/lapic.c | 23 +++++++++++++++++++++--
> > > arch/x86/kvm/lapic.h | 8 ++++++++
> > > arch/x86/kvm/vmx/vmx.c | 2 ++
> > > 3 files changed, 31 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > index e7a0660..01d3a87 100644
> > > --- a/arch/x86/kvm/lapic.c
> > > +++ b/arch/x86/kvm/lapic.c
> > > @@ -1545,13 +1545,19 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> > >
> > > tsc_deadline = apic->lapic_timer.expired_tscdeadline;
> > > apic->lapic_timer.expired_tscdeadline = 0;
> > > - guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > > + guest_tsc = kvm_read_l1_tsc(vcpu, (apic->lapic_timer.measure_delay_done == 2) ?
> > > + rdtsc() + apic->lapic_timer.vmentry_delay : rdtsc());
> > > trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
> > >
> > > if (guest_tsc < tsc_deadline)
> > > __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> > >
> > > adaptive_tune_timer_advancement(vcpu, guest_tsc, tsc_deadline);
> > > +
> > > + if (!apic->lapic_timer.measure_delay_done) {
> > > + apic->lapic_timer.measure_delay_done = 1;
> > > + apic->lapic_timer.vmentry_delay = rdtsc();
> > > + }
> > > }
> > >
> > > static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > > @@ -1837,6 +1843,18 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
> > > }
> > > }
> > >
> > > +void kvm_lapic_measure_vmentry_delay(struct kvm_vcpu *vcpu)
> > > +{
> > > + struct kvm_timer *ktimer = &vcpu->arch.apic->lapic_timer;
> >
> > This will #GP if the APIC is not in-kernel, i.e. @apic is NULL.
> >
> > > +
> > > + if (ktimer->measure_delay_done == 1) {
> > > + ktimer->vmentry_delay = rdtsc() -
> > > + ktimer->vmentry_delay;
> > > + ktimer->measure_delay_done = 2;
> >
> > Measuring the delay a single time is bound to result in random outliers,
> > e.g. if an NMI happens to occur after wait_lapic_expire().
> >
> > Rather than reinvent the wheel, can we simply move the call to
> > wait_lapic_expire() into vmx.c and svm.c? For VMX we'd probably want to
> > support the advancement if enable_unrestricted_guest=true so that we avoid
> > the emulation_required case, but other than that I don't see anything that
> > requires wait_lapic_expire() to be called where it is.
>
> I also considered to move wait_lapic_expire() into vmx.c and svm.c
> before, what do you think, Paolo, Radim?

However, guest_enter_irqoff() also prevents this. Otherwise, we will
account busy wait time as guest time. How about sampling several times
and get the average value or conservative min value to handle Sean's
concern?

Regards,
Wanpeng Li