Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

From: David Matlack
Date: Thu Feb 05 2015 - 15:39:28 EST


On Thu, Feb 5, 2015 at 8:05 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> This patch introduces a new module parameter for the KVM module; when it
> is present, KVM attempts a bit of polling on every HLT before scheduling
> itself out via kvm_vcpu_block.

Awesome. I have been working on the same feature in parallel so I have
some suggestions :)

>
> This parameter helps a lot for latency-bound workloads---in particular
> I tested it with O_DSYNC writes with a battery-backed disk in the host.
> In this case, writes are fast (because the data doesn't have to go all
> the way to the platters) but they cannot be merged by either the host or
> the guest. KVM's performance here is usually around 30% of bare metal,
> or 50% if you use cache=directsync or cache=writethrough (these
> parameters avoid that the guest sends pointless flush requests, and
> at the same time they are not slow because of the battery-backed cache).
> The bad performance happens because on every halt the host CPU decides
> to halt itself too. When the interrupt comes, the vCPU thread is then
> migrated to a new physical CPU, and in general the latency is horrible
> because the vCPU thread has to be scheduled back in.
>
> With this patch performance reaches 60-65% of bare metal and, more
> important, 99% of what you get if you use idle=poll in the guest. This

I used loopback TCP_RR and loopback memcache as benchmarks for halt
polling. I saw very similar results as you (before: 40% bare metal,
after: 60-65% bare metal and 95% of guest idle=poll).

> means that the tunable gets rid of this particular bottleneck, and more
> work can be done to improve performance in the kernel or QEMU.
>
> Of course there is some price to pay; every time an otherwise idle vCPUs
> is interrupted by an interrupt, it will poll unnecessarily and thus
> impose a little load on the host. The above results were obtained with
> a mostly random value of the parameter (2000000), and the load was around
> 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
>
> The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
> that can be used to tune the parameter. It counts how many HLT
> instructions received an interrupt during the polling period; each
> successful poll avoids that Linux schedules the VCPU thread out and back
> in, and may also avoid a likely trip to C1 and back for the physical CPU.
>
> While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
> Of these halts, almost all are failed polls. During the benchmark,
> instead, basically all halts end within the polling period, except a more
> or less constant stream of 50 per second coming from vCPUs that are not
> running the benchmark. The wasted time is thus very low. Things may
> be slightly different for Windows VMs, which have a ~10 ms timer tick.
>
> The effect is also visible on Marcelo's recently-introduced latency
> test for the TSC deadline timer. Though of course a non-RT kernel has
> awful latency bounds, the latency of the timer is around 8000-10000 clock
> cycles compared to 20000-120000 without setting halt_poll. For the TSC
> deadline timer, thus, the effect is both a smaller average latency and
> a smaller variance.
>
> Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> ---

Reviewed-by: David Matlack <dmatlack@xxxxxxxxxx>

> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 28 ++++++++++++++++++++++++----
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 22 +++++++++++++++-------
> 4 files changed, 41 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 848947ac6ade..a236e39cc385 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -655,6 +655,7 @@ struct kvm_vcpu_stat {
> u32 irq_window_exits;
> u32 nmi_window_exits;
> u32 halt_exits;
> + u32 halt_successful_poll;
> u32 halt_wakeup;
> u32 request_irq_exits;
> u32 irq_exits;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1373e04e1f19..b7b20828f01c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -96,6 +96,9 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
> static bool ignore_msrs = 0;
> module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
>
> +unsigned int halt_poll = 0;
> +module_param(halt_poll, uint, S_IRUGO | S_IWUSR);

Suggest encoding the units in the name. "halt_poll_cycles" in this case.

> +
> unsigned int min_timer_period_us = 500;
> module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>
> @@ -145,6 +148,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "irq_window", VCPU_STAT(irq_window_exits) },
> { "nmi_window", VCPU_STAT(nmi_window_exits) },
> { "halt_exits", VCPU_STAT(halt_exits) },
> + { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> { "hypercalls", VCPU_STAT(hypercalls) },
> { "request_irq", VCPU_STAT(request_irq_exits) },
> @@ -5819,13 +5823,29 @@ void kvm_arch_exit(void)
> int kvm_emulate_halt(struct kvm_vcpu *vcpu)
> {
> ++vcpu->stat.halt_exits;
> - if (irqchip_in_kernel(vcpu->kvm)) {
> - vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> - return 1;
> - } else {
> + if (!irqchip_in_kernel(vcpu->kvm)) {
> vcpu->run->exit_reason = KVM_EXIT_HLT;
> return 0;
> }
> +
> + vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> + if (halt_poll) {

Would it be useful to poll in kvm_vcpu_block() for the benefit of all
arch's?

> + u64 start, curr;
> + rdtscll(start);

Why cycles instead of time?

> + do {
> + /*
> + * This sets KVM_REQ_UNHALT if an interrupt
> + * arrives.
> + */
> + if (kvm_vcpu_check_block(vcpu) < 0) {
> + ++vcpu->stat.halt_successful_poll;
> + break;
> + }
> + rdtscll(curr);
> + } while(!need_resched() && curr - start < halt_poll);

I found that using need_resched() was not sufficient at preventing
VCPUs from delaying their own progress. To test this try running with
and without polling on a 2 VCPU VM, confined to 1 PCPU, that is running
loopback TCP_RR in the VM. The problem goes away if you stop polling as
soon as there are runnable threads on your cpu. (e.g. use
"single_task_running()" instead of "!need_resched()"
http://lxr.free-electrons.com/source/kernel/sched/core.c#L2398 ). This
also guarantees polling only delays the idle thread.

> + }
> +
> + return 1;
> }
> EXPORT_SYMBOL_GPL(kvm_emulate_halt);
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8a82838034f1..1519d48d956f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -584,6 +584,7 @@ int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
> unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
> void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
>
> +int kvm_vcpu_check_block(struct kvm_vcpu *vcpu);
> void kvm_vcpu_block(struct kvm_vcpu *vcpu);
> void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
> int kvm_vcpu_yield_to(struct kvm_vcpu *target);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0c281760a1c5..825fc3ec0509 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1813,6 +1813,20 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
> }
> EXPORT_SYMBOL_GPL(mark_page_dirty);
>
> +int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> +{
> + if (kvm_arch_vcpu_runnable(vcpu)) {
> + kvm_make_request(KVM_REQ_UNHALT, vcpu);
> + return -EINTR;
> + }
> + if (kvm_cpu_has_pending_timer(vcpu))
> + return -EINTR;
> + if (signal_pending(current))
> + return -EINTR;
> +
> + return 0;
> +}
> +
> /*
> * The vCPU has executed a HLT instruction with in-kernel mode enabled.
> */
> @@ -1823,13 +1837,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> for (;;) {
> prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
>
> - if (kvm_arch_vcpu_runnable(vcpu)) {
> - kvm_make_request(KVM_REQ_UNHALT, vcpu);
> - break;
> - }
> - if (kvm_cpu_has_pending_timer(vcpu))
> - break;
> - if (signal_pending(current))
> + if (kvm_vcpu_check_block(vcpu) < 0)
> break;
>
> schedule();
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/