Re: [PATCH 2/2] x86/idle: use dynamic halt poll

From: Wanpeng Li
Date: Tue Jul 04 2017 - 18:30:33 EST

2017-07-03 17:28 GMT+08:00 Yang Zhang <yang.zhang.wz@xxxxxxxxx>:
> On 2017/6/27 22:22, Radim KrÄmÃÅ wrote:
>> 2017-06-27 15:56+0200, Paolo Bonzini:
>>> On 27/06/2017 15:40, Radim KrÄmÃÅ wrote:
>>>>> ... which is not necessarily _wrong_. It's just a different heuristic.
>>>> Right, it's just harder to use than host's single_task_running() -- the
>>>> VCPU calling vcpu_is_preempted() is never preempted, so we have to look
>>>> at other VCPUs that are not halted, but still preempted.
>>>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
>>>> yield to the host. Working under the assumption that there is work for
>>>> this PCPU if other VCPUs have stuff to do. The downside is that it
>>>> misses information about host's topology, so it would be hard to make it
>>>> work well.
>>> I would just use vcpu_is_preempted on the current CPU. From guest POV
>>> this option is really a "f*** everyone else" setting just like
>>> idle=poll, only a little more polite.
>> vcpu_is_preempted() on current cpu cannot return true, AFAIK.
>>> If we've been preempted and we were polling, there are two cases. If an
>>> interrupt was queued while the guest was preempted, the poll will be
>>> treated as successful anyway.
>> I think the poll should be treated as invalid if the window has expired
>> while the VCPU was preempted -- the guest can't tell whether the
>> interrupt arrived still within the poll window (unless we added paravirt
>> for that), so it shouldn't be wasting time waiting for it.
>>> If it hasn't, let others run---but really
>>> that's not because the guest wants to be polite, it's to avoid that the
>>> scheduler penalizes it excessively.
>> This sounds like a VM entry just to do an immediate VM exit, so paravirt
>> seems better here as well ... (the guest telling the host about its
>> window -- which could also be used to rule it out as a target in the
>> pause loop random kick.)
>>> So until it's preempted, I think it's okay if the guest doesn't care
>>> about others. You wouldn't use this option anyway in overcommitted
>>> situations.
>>> (I'm still not very convinced about the idea).
>> Me neither. (The same mechanism is applicable to bare-metal, but was
>> never used there, so I would rather bring the guest behavior closer to
>> bare-metal.)
> The background is that we(Alibaba Cloud) do get more and more complaints
> from our customers in both KVM and Xen compare to bare-mental.After
> investigations, the root cause is known to us: big cost in message passing
> workload(David show it in KVM forum 2015)
> A typical message workload like below:
> vcpu 0 vcpu 1
> 1. send ipi 2. doing hlt
> 3. go into idle 4. receive ipi and wake up from hlt
> 5. write APIC time twice 6. write APIC time twice to
> to stop sched timer reprogram sched timer

I didn't find these two scenarios will program APIC timer twice
separately instead of once separately, could you point out the codes?

Wanpeng Li

> 7. doing hlt 8. handle task and send ipi to
> vcpu 0
> 9. same to 4. 10. same to 3
> One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
> cost of such vmexits will degrades performance severely. Linux kernel
> already provide idle=poll to mitigate the trend. But it only eliminates the
> IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
> compromise would be to turn off NOHZ kernel, but it is not the default
> config for new distributions. Same for halt-poll in KVM, it only solve the
> cost from schedule in/out in host and can not help such workload much.
> The purpose of this patch we want to improve current idle=poll mechanism to
> use dynamic polling and do poll before touch sched timer. It should not be a
> virtualization specific feature but seems bare mental have low cost to
> access the MSR. So i want to only enable it in VM. Though the idea below the
> patch may not so perfect to fit all conditions, it looks no worse than now.
> How about we keep current implementation and i integrate the patch to
> para-virtualize part as Paolo suggested? We can continue discuss it and i
> will continue to refine it if anyone has better suggestions?
> --
> Yang
> Alibaba Cloud Computing