Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

From: Li, Aubrey
Date: Tue Jul 11 2017 - 00:40:33 EST

On 2017/7/11 1:27, Andi Kleen wrote:
> On Mon, Jul 10, 2017 at 06:42:06PM +0200, Peter Zijlstra wrote:
>> On Mon, Jul 10, 2017 at 07:46:09AM -0700, Andi Kleen wrote:
>>>> So how much of the gain is simply due to skipping NOHZ? Mike used to
>>>> carry a patch that would throttle NOHZ. And that is a _far_ smaller and
>>>> simpler patch to do.
>>> Have you ever looked at a ftrace or PT trace of the idle entry?
>>> There's just too much stuff going on there. NOHZ is just the tip
>>> of the iceberg.
>> I have, and last time I did the actual poking at the LAPIC (to make NOHZ
>> happen) was by far the slowest thing happening.
> That must have been a long time ago because modern systems use TSC deadline
> for a very long time ...
> It's still slow, but not as slow as the LAPIC.
>> Data to indicate what hurts how much would be a very good addition to
>> the Changelogs. Clearly you have some, you really should have shared.
Here is an article indicates why we need to improve this:

Given that we have a few new low-latency I/O devices like Xpoint 3D memory,
25/40GB Ethernet, etc, this proposal targets to improve the latency of
microsecond(us)-scale events as well.

Basically we are looking at how much we can improve(instead of what hurts),
the data is against v4.8.8.

In the idle loop,

- quiet_vmstat costs 5562ns - 6296ns
- tick_nohz_idle_enter costs 7058ns - 10726ns
- totally from arch_cpu_idle_enter entry to arch_cpu_idle_exit return costs
9122ns - 15318ns.
--In this period, rcu_idle_enter costs 1985ns - 2262ns, rcu_idle_exit costs
1813ns - 3507ns
- tick_nohz_idle_exit costs 8372ns - 20850ns

Benchmark fio on a NVMe disk shows 3-4% improvement due to skipping nohz, extra
1-2% improvement overall

Benchmark netperf loopback in TCP Request-Response mode shows 6-7% improvement
due to skipping nohz, extra 2-3% improvement overall

Note, the data includes measurement overhead, and it could be varied on the
different platforms, different CPU frequency, and different workload, but they
are consistent once the testing configuration is fixed.