Re: [PATCH 2/2] x86/idle: use dynamic halt poll
From: Radim KrÄmÃÅ
Date: Tue Jul 04 2017 - 10:13:48 EST
2017-07-03 17:28+0800, Yang Zhang:
> The background is that we(Alibaba Cloud) do get more and more complaints
> from our customers in both KVM and Xen compare to bare-mental.After
> investigations, the root cause is known to us: big cost in message passing
> workload(David show it in KVM forum 2015)
> A typical message workload like below:
> vcpu 0 vcpu 1
> 1. send ipi 2. doing hlt
> 3. go into idle 4. receive ipi and wake up from hlt
> 5. write APIC time twice 6. write APIC time twice to
> to stop sched timer reprogram sched timer
One write is enough to disable/re-enable the APIC timer -- why does
Linux use two?
> 7. doing hlt 8. handle task and send ipi to
> vcpu 0
> 9. same to 4. 10. same to 3
> One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
> cost of such vmexits will degrades performance severely.
Yeah, sounds like too much ... I understood that there are
IPI from 1 to 2
4 * APIC timer
IPI from 2 to 1
which adds to 6 MSR writes -- what are the other 4?
> Linux kernel
> already provide idle=poll to mitigate the trend. But it only eliminates the
> IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
> compromise would be to turn off NOHZ kernel, but it is not the default
> config for new distributions. Same for halt-poll in KVM, it only solve the
> cost from schedule in/out in host and can not help such workload much.
> The purpose of this patch we want to improve current idle=poll mechanism to
Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
down the sibling hyperthread. MWAIT solves the IPI problem, but doesn't
get rid of the timer one.
> use dynamic polling and do poll before touch sched timer. It should not be a
> virtualization specific feature but seems bare mental have low cost to
> access the MSR. So i want to only enable it in VM. Though the idea below the
> patch may not so perfect to fit all conditions, it looks no worse than now.
It adds code to hot-paths (interrupt handlers) while trying to optimize
an idle-path, which is suspicious.
> How about we keep current implementation and i integrate the patch to
> para-virtualize part as Paolo suggested? We can continue discuss it and i
> will continue to refine it if anyone has better suggestions?
I think there is a nicer solution to avoid the expensive timer rewrite:
Linux uses one-shot APIC timers and getting the timer interrupt is about
as expensive as programming the timer, so the guest can keep the timer
armed, but not re-arm it after the expiration if the CPU is idle.
This should also mitigate the problem with short idle periods, but the
optimized window is anywhere between 0 to 1ms.
Do you see disadvantages of this combined with MWAIT?