Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low
From: Alan Jenkins
Date: Fri Jul 05 2019 - 07:25:53 EST
Hi, scheduler experts!
My cpu "iowait" time appears to be reported incorrectly. Do you know
why this could happen?
Doug helped me - it was he who noticed different behaviour on v4.15 vs
v4.16 vs v4.17+. So I have some confirmation of this. I don't think
Doug mentioned what hardware he ran the kernels on. lscpu says my
hardware is "Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz".
I tested using "dd" :-
dd if=bigfile bs=1M iflag=direct of=/dev/null
Â(1) E.g. I get the expected result if I pin "dd" to the "right" cpu.Â
Which cpu varies; it has often been cpu2. At the moment I have booted
5.2-rc5-ish. Here I run "dd" with "taskset -c 0", and "iowait" worked
as expected:
top - 11:01:47 up 15:10, 2 users, load average: 1.07, 0.86, 0.86
Tasks: 288 total, 1 running, 287 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.3 us, 3.3 sy, 0.0 ni, 0.0 id, 94.3 wa, 0.7 hi, 1.3 si, 0.0 st
%Cpu1 : 0.3 us, 0.7 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st
%Cpu2 : 1.0 us, 0.7 sy, 0.0 ni, 98.0 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st
%Cpu3 : 0.7 us, 0.3 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 7854.0 total, 432.2 free, 4616.4 used, 2805.4 buff/cache
MiB Swap: 2048.0 total, 1978.2 free, 69.8 used. 2498.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31849 alan-sy+ 20 0 216052 2836 1800 D 3.0 0.0 0:00.58 dd
24220 alan-sy+ 20 0 3339828 232160 126720 S 0.7 2.9 1:53.14 gnome-shell
...
(I have also used "atop" and "vmstat 3". "atop" shows both total and
per-cpu iowait, idle, etc. "vmstat 3" just shows a total, but all the
old values stay on-screen).
Â(2) But compare running "dd" with "taskset -c 1":
%Cpu0 : 0.3 us, 0.3 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.7 hi, 1.3 si, 0.0 st
%Cpu1 : 0.3 us, 3.0 sy, 0.0 ni, 83.7 id, 12.6 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu2 : 1.0 us, 0.3 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 1.3 us, 0.7 sy, 0.0 ni, 98.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
Â(3) If I don't use "taskset", "dd" generally doesn't sit on the
"right" cpu, and so I get don't see the right "iowait". Here's "top -d
30". Over this longer interval, "dd" appears to spend a quarter of its
time on the "right" cpu:
%Cpu0 : 0.5 us, 1.2 sy, 0.0 ni, 74.4 id, 22.2 wa, 0.5 hi, 1.3 si, 0.0 st
%Cpu1 : 0.7 us, 1.3 sy, 0.0 ni, 92.3 id, 5.7 wa, 0.1 hi, 0.0 si, 0.0 st
%Cpu2 : 0.5 us, 0.9 sy, 0.0 ni, 95.1 id, 3.3 wa, 0.1 hi, 0.0 si, 0.0 st
%Cpu3 : 0.7 us, 0.9 sy, 0.0 ni, 94.6 id, 3.6 wa, 0.1 hi, 0.0 si, 0.0 st
This point (3) does not apply to 4.15. On 4.15, it seems "dd" naturally
sits on the "right" cpu, so I get the "right" iowait". But if I pin "dd"
to a different cpu, I get the "wrong" iowait again.
I bisected 4.15-4.16. The first "bad" commit was 806486c377e3
"sched/fair: Do not migrate if the prev_cpu is idle"
Â(4) I can get the "right" iowait regardless of which cpu, if I boot
with "nohz=off", or if I suppress nohz by dynamically disabling all
cpuidle states except for state0 (POLL).
This point (4) does not apply to 4.16. On 4.15 and 4.16, suppressing
nohz does not help. (So far, I did not test 4.15).
I bisected 4.16-4.17.ÂÂÂ The first "new" commit was 554c8aa8ecad "sched:
idle: Select idle state before stopping the tick"
Â(5) I seem to get the "right" iowait regardless of which cpu, if I run
inside a virtual machine. I tested stock Fedora v5.1.? inside a KVM
(virt-manager) virtual machine, which also had 4 cpus. Whereas
un-virtualized Fedora v5.1.? on my laptop, behaves as per points 1-4.
I read the documented limitations for "iowait" time. As far as I
understand them, they don't explain such inconsistent values.
- iowait: In a word, iowait stands for waiting for I/O to complete. But there
are several problems:
1. Cpu will not wait for I/O to complete, iowait is the time that a task is
  waiting for I/O to complete. When cpu goes into idle state for
  outstanding task io, another task will be scheduled on this CPU.
2. In a multi-core CPU, the task waiting for I/O to complete is not running
  on any CPU, so the iowait of each CPU is difficult to calculate.
3. The value of iowait field in /proc/stat will decrease in certain
  conditions
Thanks for all the kernels
Alan