Re: [PATCH v4] KVM: halt-polling: poll for the upcoming fire timers

From: David Matlack
Date: Tue May 24 2016 - 19:38:14 EST

On Tue, May 24, 2016 at 4:11 PM, Wanpeng Li <kernellwp@xxxxxxxxx> wrote:
> 2016-05-25 6:38 GMT+08:00 David Matlack <dmatlack@xxxxxxxxxx>:
>> On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li <kernellwp@xxxxxxxxx> wrote:
>>> From: Wanpeng Li <>
>>> If an emulated lapic timer will fire soon(in the scope of 10us the
>>> base of dynamic halt-polling, lower-end of message passing workload
>>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>>> This can avoid context switch overhead and the latency which we wake
>>> up vCPU.
>>> This feature is slightly different from current advance expiration
>>> way. Advance expiration rely on the vCPU is running(do polling before
>>> vmentry). But in some cases, the timer interrupt may be blocked by
>>> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
>>> run immediately. So even advance the timer early, vCPU may still see
>>> the latency. But polling is different, it ensures the vCPU to aware
>>> the timer expiration before schedule out.
>>> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
>>> Context switching - times in microseconds - smaller is better
>>> -------------------------------------------------------------------------
>>> Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>>> ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
>>> --------- ------------- ------ ------ ------ ------ ------ ------- -------
>>> kernel Linux 4.6.0+ 7.9800 11.0 10.8 14.6 9.4300 13.0 10.2 vanilla
>>> kernel Linux 4.6.0+ 15.3 13.6 10.7 12.5 9.0000 12.8 7.38000 poll
>> These results aren't very compelling. Sometimes polling is faster,
>> sometimes vanilla is faster, sometimes they are about the same.
> More processes and bigger cache footprints can get benefit from the
> result since I open the hrtimer for the precision preemption.

The VCPU is halted (idle), so the timer interrupt is not preempting
anything. Also I would not expect any preemption in a context
switching benchmark, the threads should be handing off execution to
one another.

I'm confused why timers would play any role in the performance of this
benchmark. Any idea why there's a speedup in the 8p/16K and 16p/64K

> Actually
> I try to emulate Yang's workload,
> And his real workload can get more benefit as he mentioned,
>> I imagine there are hyper sensitive workloads which cannot tolerate a
>> long tail in timer latency (e.g. realtime workloads). I would expect a
>> patch like this to provide a "smoothing effect", reducing that tail.
>> But for cloud/server workloads, I would not expect any sensitivity to
>> jitter in timer latency (especially while the VCPU is halted).
> Yang's is real cloud workload.

I have 2 issues with optimizing for Yang's workload. Yang, please
correct me if I am mis-characterizing it.
1. The delay in timer interrupts is caused by something disabling the
interrupts on the CPU for more than a millisecond. It seems that is
the real issue. I'm wary of using polling as a workaround.
2. The delay is caused by a separate task. Halt-polling would not help
in that scenario, it would yield the CPU to that task.

>> Note that while halt-polling happens when the CPU is idle, it's still
>> not free. It constricts the scheduler's cpu load balancer, because the
>> CPU appears to be busy. In KVM's default configuration, I'd prefer to
>> only add more polling when the gain is clear. If there are guest
>> workloads that want this patch, I'd suggest polling for timers be
>> default-off. At minimum, there should be a module parameter to control
>> it (like Christian Borntraeger suggested).
> Yeah, I will add the module parameter in order to enable/disable.
> Regards,
> Wanpeng Li