Re: [PATCH 12/14] KVM: retpolines: x86: eliminate retpoline from vmx.c exit handlers

From: Paolo Bonzini
Date: Wed Oct 16 2019 - 13:01:29 EST


On 16/10/19 18:50, Andrea Arcangeli wrote:
>> It still doesn't add up. 0.3ms / 5 is 1/15000th of a second; 43us is
>> 1/25000th of a second. Do you have multiple vCPU perhaps?
>
> Why would I run any test on UP guests? Rather then spending time doing
> the math on my results, it's probably quicker that you run it yourself:

I don't know, but if you don't say how many vCPUs you have, I cannot do
the math and review the patch.

>> The number of vmexits doesn't count (for HLT). What counts is how long
>> they take to be serviced, and as long as it's 1us or more the
>> optimization is pointless.
>
> Please note the single_task_running() check which immediately breaks
> the kvm_vcpu_check_block() loop if there's even a single other task
> that can be scheduled in the runqueue of the host CPU.
>
> What happen when the host is not idle is quoted below:
>
> w/o optimization with optimization
> ---------------------- -------------------------
> 0us vmexit vmexit
> 500ns retpoline call vmexit handler directly
> 600ns retpoline kvm_vcpu_check_block()
> 700ns retpoline schedule()
> 800ns kvm_vcpu_check_block()
> 900ns schedule()
> ...
>
> Disclaimer: the numbers on the left are arbitrary and I just cut and
> pasted them from yours, no idea how far off they are.

Yes, of course. But the idea is the same: yes, because of the retpoline
you run the guest for perhaps 300ns more before schedule()ing, but does
that really matter? 300ns * 20000 times/second is a 0.6% performance
impact, and 300ns is already very generous. I am not sure it would be
measurable at all.

Paolo

> To be clear, I would find it very reasonable to be requested to proof
> the benefit of the HLT optimization with benchmarks specifics for that
> single one liner, but until then, the idea that we can drop the
> retpoline optimization from the HLT vmexit by just thinking about it,
> still doesn't make sense to me, because by thinking about it I come to
> the opposite conclusion.
>
> The lack of single_task_running() in the guest driver is also why the
> guest cpuidle haltpoll risks to waste some CPU with host overcommit or
> with the host loaded at full capacity and why we may not assume it to
> be universally enabled.
>
> Thanks,
> Andrea
>