Re: [PATCH v4] KVM: halt-polling: poll for the upcoming fire timers

From: Yang Zhang
Date: Tue May 24 2016 - 22:10:24 EST

On 2016/5/25 7:37, David Matlack wrote:
On Tue, May 24, 2016 at 4:11 PM, Wanpeng Li <kernellwp@xxxxxxxxx> wrote:
2016-05-25 6:38 GMT+08:00 David Matlack <dmatlack@xxxxxxxxxx>:
On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li <kernellwp@xxxxxxxxx> wrote:
From: Wanpeng Li <>

If an emulated lapic timer will fire soon(in the scope of 10us the
base of dynamic halt-polling, lower-end of message passing workload
latency TCP_RR's poll time < 10us) we can treat it as a short halt,
and poll to wait it fire, the fire callback apic_timer_fn() will set
KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
This can avoid context switch overhead and the latency which we wake
up vCPU.

This feature is slightly different from current advance expiration
way. Advance expiration rely on the vCPU is running(do polling before
vmentry). But in some cases, the timer interrupt may be blocked by
other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
run immediately. So even advance the timer early, vCPU may still see
the latency. But polling is different, it ensures the vCPU to aware
the timer expiration before schedule out.

echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.

Context switching - times in microseconds - smaller is better
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
kernel Linux 4.6.0+ 7.9800 11.0 10.8 14.6 9.4300 13.0 10.2 vanilla
kernel Linux 4.6.0+ 15.3 13.6 10.7 12.5 9.0000 12.8 7.38000 poll

These results aren't very compelling. Sometimes polling is faster,
sometimes vanilla is faster, sometimes they are about the same.

More processes and bigger cache footprints can get benefit from the
result since I open the hrtimer for the precision preemption.

The VCPU is halted (idle), so the timer interrupt is not preempting
anything. Also I would not expect any preemption in a context
switching benchmark, the threads should be handing off execution to
one another.

I'm confused why timers would play any role in the performance of this
benchmark. Any idea why there's a speedup in the 8p/16K and 16p/64K

I try to emulate Yang's workload,
And his real workload can get more benefit as he mentioned,

I imagine there are hyper sensitive workloads which cannot tolerate a
long tail in timer latency (e.g. realtime workloads). I would expect a
patch like this to provide a "smoothing effect", reducing that tail.
But for cloud/server workloads, I would not expect any sensitivity to
jitter in timer latency (especially while the VCPU is halted).

Yang's is real cloud workload.

I have 2 issues with optimizing for Yang's workload. Yang, please
correct me if I am mis-characterizing it.
1. The delay in timer interrupts is caused by something disabling the
interrupts on the CPU for more than a millisecond. It seems that is
the real issue. I'm wary of using polling as a workaround.

Yes, this is the most likely case.

2. The delay is caused by a separate task. Halt-polling would not help
in that scenario, it would yield the CPU to that task.

In some cases, the separate task is migrated from other CPU after CPU enter idle state. So Halt-polling may still help. And the delay is caused by two context switches(VCPU schedule out and migrate VCPU to another idle CPU).

Note that while halt-polling happens when the CPU is idle, it's still
not free. It constricts the scheduler's cpu load balancer, because the
CPU appears to be busy. In KVM's default configuration, I'd prefer to
only add more polling when the gain is clear. If there are guest
workloads that want this patch, I'd suggest polling for timers be
default-off. At minimum, there should be a module parameter to control
it (like Christian Borntraeger suggested).

Yeah, I will add the module parameter in order to enable/disable.

Wanpeng Li

best regards