Re: [PATCH 2/2] x86/idle: use dynamic halt poll

From: Yang Zhang
Date: Mon Jul 03 2017 - 05:29:18 EST

On 2017/6/27 22:22, Radim KrÄmÃÅ wrote:
2017-06-27 15:56+0200, Paolo Bonzini:
On 27/06/2017 15:40, Radim KrÄmÃÅ wrote:
... which is not necessarily _wrong_. It's just a different heuristic.
Right, it's just harder to use than host's single_task_running() -- the
VCPU calling vcpu_is_preempted() is never preempted, so we have to look
at other VCPUs that are not halted, but still preempted.

If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
yield to the host. Working under the assumption that there is work for
this PCPU if other VCPUs have stuff to do. The downside is that it
misses information about host's topology, so it would be hard to make it
work well.

I would just use vcpu_is_preempted on the current CPU. From guest POV
this option is really a "f*** everyone else" setting just like
idle=poll, only a little more polite.

vcpu_is_preempted() on current cpu cannot return true, AFAIK.

If we've been preempted and we were polling, there are two cases. If an
interrupt was queued while the guest was preempted, the poll will be
treated as successful anyway.

I think the poll should be treated as invalid if the window has expired
while the VCPU was preempted -- the guest can't tell whether the
interrupt arrived still within the poll window (unless we added paravirt
for that), so it shouldn't be wasting time waiting for it.

If it hasn't, let others run---but really
that's not because the guest wants to be polite, it's to avoid that the
scheduler penalizes it excessively.

This sounds like a VM entry just to do an immediate VM exit, so paravirt
seems better here as well ... (the guest telling the host about its
window -- which could also be used to rule it out as a target in the
pause loop random kick.)

So until it's preempted, I think it's okay if the guest doesn't care
about others. You wouldn't use this option anyway in overcommitted

(I'm still not very convinced about the idea).

Me neither. (The same mechanism is applicable to bare-metal, but was
never used there, so I would rather bring the guest behavior closer to

The background is that we(Alibaba Cloud) do get more and more complaints from our customers in both KVM and Xen compare to bare-mental.After investigations, the root cause is known to us: big cost in message passing workload(David show it in KVM forum 2015)

A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2. doing hlt
3. go into idle 4. receive ipi and wake up from hlt
5. write APIC time twice 6. write APIC time twice to
to stop sched timer reprogram sched timer
7. doing hlt 8. handle task and send ipi to
vcpu 0
9. same to 4. 10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The cost of such vmexits will degrades performance severely. Linux kernel already provide idle=poll to mitigate the trend. But it only eliminates the IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A compromise would be to turn off NOHZ kernel, but it is not the default config for new distributions. Same for halt-poll in KVM, it only solve the cost from schedule in/out in host and can not help such workload much.

The purpose of this patch we want to improve current idle=poll mechanism to use dynamic polling and do poll before touch sched timer. It should not be a virtualization specific feature but seems bare mental have low cost to access the MSR. So i want to only enable it in VM. Though the idea below the patch may not so perfect to fit all conditions, it looks no worse than now.
How about we keep current implementation and i integrate the patch to para-virtualize part as Paolo suggested? We can continue discuss it and i will continue to refine it if anyone has better suggestions?

Alibaba Cloud Computing