Re: [PATCH] Add busy loop polling for idle SMT

From: Peter Zijlstra
Date: Wed Nov 17 2021 - 05:58:59 EST


On Tue, Nov 16, 2021 at 07:51:35PM +0800, Peng Wang wrote:
> Now we have cpu_idle_force_poll which uses cpu_relax() waiting for
> an arriving IPI, while sometimes busy loop on idle cpu is also
> useful to provide consistent pipeline interference for hardware SMT.
>
> When hardware SMT is enabled, the switching between idle and
> busy state of one cpu will cause performance fluctuation of
> other sibling cpus on the same core.
>
> In pay-for-execution-time scenario, cloud service providers prefer
> stable performance data to set stabel price for same workload.
> Different execution time of the same workload caused by different
> idle or busy state of sibling SMT cpus will make different bills, which
> is confused for customers.
>
> Since there is no dynamic CPU time scaling based on SMT pipeline interference,
> to coordinate sibling SMT noise no matter whether they are idle or not,
> busy loop in idle state can provide approximately consistent pipeline interference.
>
> For example, a workload computing tangent and cotangent will finish in 9071ms when
> sibling SMT cpus are idle, and 13299ms when sibling SMT cpus are computiing other workload.
> This generate 32% performance fluctuation.
>
> SMT idle polling makes things slower, but we can set bigger cpu quota to make up
> a deficiency. This also increase power consumption by 2.2%, which is acceptable.
>
> There may be some other possible solutions, while each has its own problem:
> a) disbale hardware SMT, which means half of SMT is unused and more hardware cost.
> b) busy loop in a userspace thread, but the cpu usage is confusing.
>
> We propose this patch to discuss the performance fluctuation problem related to SMT
> pipeline interference, and any comments are welcome.

I think you missed April Fools' Day by a wide margin.

Lowering performance and increasing power usage is a direct
contradiction to sanity. It also doesn't really work as advertised,
if the siblings are competing for AVX resources the performance is a
*lot* lower than when an AVX task is competing against a spinner like
this.