RE: NO_HZ_IDLE causes consistently low cpu "iowait" time (and higher cpu "idle" time)

From: Doug Smythies
Date: Wed Jul 03 2019 - 10:06:44 EST


On 2019.07.01 08:34 Alan Jenkins wrote:

> Hi

Hi,

> I tried running a simple test:
>
> dd if=testfile iflag=direct bs=1M of=/dev/null
>
> With my default settings, `vmstat 10` shows something like 85% idle time
> to 15% iowait time. I have 4 CPUs, so this is much less than one CPU
> worth of iowait time.
>
> If I boot with "nohz=off", I see idle time fall to 75% or below, and
> iowait rise to about 25%, equivalent to one CPU. That is what I had
> originally expected.
>
> (I can also see my expected numbers, if I disable *all* C-states and
> force polling using `pm_qos_resume_latency_us` in sysfs).
>
> The numbers above are from a kernel somewhere around v5.2-rc5. I saw
> the "wrong" results on some previous kernels as well. I just now
> realized the link to NO_HZ_IDLE.[1]
>
> [1]
> https://unix.stackexchange.com/questions/517757/my-basic-assumption-about-system-iowait-does-not-hold/527836#527836
>
> I did not find any information about this high level of inaccuracy. Can
> anyone explain, is this behaviour expected?

I'm not commenting on expected behaviour or not, just that it is
inconsistent.

>
> I found several patches that mentioned "iowait" and NO_HZ_IDLE. But if
> they described this problem, it was not clear to me.
>
> I thought this might also be affecting the "IO pressure" values from the
> new "pressure stall information"... but I am too confused already, so I
> am only asking about iowait at the moment :-).

Using your workload, I confirm inconsistent behaviour for /proc/stat
(which vmstat uses) between kernels 4.15, 4.16, and 4.17:
4.15 does what you expect, no matter idle states enabled or disabled.
4.16 doesn't do what you expect regardless. (although a little erratic.)
>= 4.17 does what you expect with only idle state 0 enabled, and doesn't otherwise.

Actual test data vmstat (/proc/stat) (8 CPUs, 12.5% = 1 CPU)):
Kernel idle/iowait % Idle states >= 1
4.15 88/12 enabled
4.15 88/12 disabled
4.16 99/1 enabled
4.16 99/1 disabled
4.17 98/2 enabled
4.17 88/12 disabled

Note 1: I never booted with "nohz=off" because the tick never turns off for
idle state 0, which is good enough for testing.

Note 2: Myself, I don't use /proc/stat for idle time statistics. I use:
/sys/devices/system/cpu/cpu*/cpuidle/state*/time
And they seem to always be consistent at the higher idle percentage number.

Unless someone has some insight, the next step is kernel bisection,
once for between kernel 4.15 and 4.16, then again between 4.16 and 4.17.
The second bisection might go faster with knowledge gained from the first.
Alan: Can you do kernel bisection? I can only do it starting maybe Friday.

... Doug