Re: [PATCH v2] KVM: LAPIC: Periodically revaluate to get conservative lapic_timer_advance_ns

From: Wanpeng Li
Date: Wed Aug 28 2019 - 04:09:24 EST


On Wed, 28 Aug 2019 at 01:47, Radim KrÄmÃÅ <rkrcmar@xxxxxxxxxx> wrote:
>
> 2019-08-15 12:03+0800, Wanpeng Li:
> > From: Wanpeng Li <wanpengli@xxxxxxxxxxx>
> >
> > Even if for realtime CPUs, cache line bounces, frequency scaling, presence
> > of higher-priority RT tasks, etc can still cause different response. These
> > interferences should be considered and periodically revaluate whether
> > or not the lapic_timer_advance_ns value is the best, do nothing if it is,
> > otherwise recaluate again. Set lapic_timer_advance_ns to the minimal
> > conservative value from all the estimated values.
>
> IIUC, this patch is looking for the minimal timer_advance_ns because it
> provides the best throughput:
> When every code path ran as fast as possible and we don't have to wait
> for the timer to expire, but still arrive exactly at the point when it
> would have expired.
> We always arrive late late if something delayed the execution, which
> means higher latencies, but RT shouldn't be using an adaptive algorithm
> anyway, so that is not an issue.
>
> The computed conservative timer_advance_ns will converge to the minimal
> measured timer_advance_ns as time progresses, because it can only go
> down and will do so repeatedly by small steps as even one smaller
> measurement sufficiently close is enough to decrease it.
>
> With that in mind, wouldn't the following patch (completely untested)
> give you about the same numbers?
>
> The idea is that if we have to wait, we are wasting time and therefore
> decrease timer_advance_ns to eliminate the time spent waiting.
>
> The first run is special and just sets timer_advance_ns to the latency
> we measured, regardless of what it is -- it deals with the possibility
> that the default was too low.
>
> This algorithm is also likely prone to turbo boost making few runs
> faster than what is then achieved during a more common workload, but
> we'd need to have a sliding window or some other more sophisticated
> algorithm in order to deal with that.
>
> I also like Paolo's idea of a smoothing -- if we use a moving average
> based on advance_expire_delta, we wouldn't even have to convert it into
> ns unless it reached a given threshold, which could make decently fast
> to be run every time.
>
> Something like
>
> moving_average = (apic->lapic_timer.moving_average * (weight - 1) + advance_expire_delta) / weight
>
> if (moving_average > threshold)
> recompute timer_advance_ns
>
> apic->lapic_timer.moving_average = moving_average
>
> where weight would be a of 2 to make the operation fast.
>
> This kind of moving average gives less value to old inputs and the
> weight allows us to control the reaction speed of the approximation.
> (A small number like 4 or 8 seems about right.)
>
> I don't have any information on the latency, though.
> Do you think that the added overhead isn't worth the smoothing?
>
> Thanks.
>
> ---8<---
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index e904ff06a83d..d7f2af2eb3ce 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1491,23 +1491,20 @@ static inline void adjust_lapic_timer_advance(struct kvm_vcpu *vcpu,
> if (advance_expire_delta < 0) {
> ns = -advance_expire_delta * 1000000ULL;
> do_div(ns, vcpu->arch.virtual_tsc_khz);
> - timer_advance_ns -= min((u32)ns,
> - timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> + timer_advance_ns -= (u32)ns;
> } else {
> /* too late */
> + /* This branch can only be taken on the initial calibration. */
> + if (apic->lapic_timer.timer_advance_adjust_done)
> + pr_err_once("kvm: broken expectation in lapic timer_advance_ns");
> +
> ns = advance_expire_delta * 1000000ULL;
> do_div(ns, vcpu->arch.virtual_tsc_khz);
> - timer_advance_ns += min((u32)ns,
> - timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> + timer_advance_ns += (u32)ns;
> }
>
> - if (abs(advance_expire_delta) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
> - apic->lapic_timer.timer_advance_adjust_done = true;
> - if (unlikely(timer_advance_ns > 5000)) {
> - timer_advance_ns = LAPIC_TIMER_ADVANCE_ADJUST_INIT;
> - apic->lapic_timer.timer_advance_adjust_done = false;
> - }
> apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> + apic->lapic_timer.timer_advance_adjust_done = true;
> }
>
> static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> @@ -1526,7 +1523,7 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> if (guest_tsc < tsc_deadline)
> __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
>
> - if (unlikely(!apic->lapic_timer.timer_advance_adjust_done))
> + if (unlikely(!apic->lapic_timer.timer_advance_adjust_done) || guest_tsc < tsc_deadline)
> adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
> }
>

Something like below, the result is not as good as we expected. How
about v3? I use moving average to be smooth and filter out drastic
fluctuation which prevents it before.

(testing on Skylake server, the lapic_timer_advance_ns is around
1500ns for my v2 patch)
# cat vcpu*/lapic_timer_advance_ns
483
263
211
15

---8<---
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e904ff0..bdc0702 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1485,29 +1485,50 @@ static inline void
adjust_lapic_timer_advance(struct kvm_vcpu *vcpu,
{
struct kvm_lapic *apic = vcpu->arch.apic;
u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
+ s64 moving_average;
u64 ns;

+ if (abs(advance_expire_delta) > 10000)
+ /* filter out drastic flunctuations */
+ return;
+
+ if (apic->lapic_timer.timer_advance_adjust_done) {
+ moving_average = (apic->lapic_timer.moving_average *
+ (LAPIC_TIMER_ADVANCE_ADJUST_STEP - 1) +
+ advance_expire_delta) / LAPIC_TIMER_ADVANCE_ADJUST_STEP;
+
+ if (abs(moving_average) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
+ /* no update for random fluctuations */
+ return;
+
+ apic->lapic_timer.moving_average = moving_average;
+ advance_expire_delta = moving_average;
+ } else
+ apic->lapic_timer.moving_average = advance_expire_delta;
+
/* too early */
if (advance_expire_delta < 0) {
ns = -advance_expire_delta * 1000000ULL;
do_div(ns, vcpu->arch.virtual_tsc_khz);
- timer_advance_ns -= min((u32)ns,
- timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+ timer_advance_ns -= (u32)ns;
} else {
/* too late */
+ /* This branch can only be taken on the initial calibration. */
+ if (apic->lapic_timer.timer_advance_adjust_done)
+ pr_err_once("kvm: broken expectation in lapic timer_advance_ns");
+
ns = advance_expire_delta * 1000000ULL;
do_div(ns, vcpu->arch.virtual_tsc_khz);
- timer_advance_ns += min((u32)ns,
- timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+ timer_advance_ns += (u32)ns;
}

- if (abs(advance_expire_delta) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
- apic->lapic_timer.timer_advance_adjust_done = true;
if (unlikely(timer_advance_ns > 5000)) {
timer_advance_ns = LAPIC_TIMER_ADVANCE_ADJUST_INIT;
apic->lapic_timer.timer_advance_adjust_done = false;
}
+
apic->lapic_timer.timer_advance_ns = timer_advance_ns;
+ apic->lapic_timer.timer_advance_adjust_done = true;
}

static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
@@ -1526,7 +1547,8 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
if (guest_tsc < tsc_deadline)
__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);

- if (unlikely(!apic->lapic_timer.timer_advance_adjust_done))
+ if (unlikely(!apic->lapic_timer.timer_advance_adjust_done) ||
+ guest_tsc < tsc_deadline)
adjust_lapic_timer_advance(vcpu,
apic->lapic_timer.advance_expire_delta);
}

diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 50053d2..2e6e499 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -33,6 +33,7 @@ struct kvm_timer {
u64 expired_tscdeadline;
u32 timer_advance_ns;
s64 advance_expire_delta;
+ s64 moving_average;
atomic_t pending; /* accumulated triggered timers */
bool hv_timer_in_use;
bool timer_advance_adjust_done;