Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low

From: Peter Zijlstra
Date: Fri Jul 05 2019 - 07:38:24 EST


On Fri, Jul 05, 2019 at 12:25:46PM +0100, Alan Jenkins wrote:
> Hi, scheduler experts!
>
> My cpu "iowait" time appears to be reported incorrectly.  Do you know why
> this could happen?

Because iowait is a magic random number that has no sane meaning.
Personally I'd prefer to just delete the whole thing, except ABI :/

Also see the comment near nr_iowait():

/*
* IO-wait accounting, and how its mostly bollocks (on SMP).
*
* The idea behind IO-wait account is to account the idle time that we could
* have spend running if it were not for IO. That is, if we were to improve the
* storage performance, we'd have a proportional reduction in IO-wait time.
*
* This all works nicely on UP, where, when a task blocks on IO, we account
* idle time as IO-wait, because if the storage were faster, it could've been
* running and we'd not be idle.
*
* This has been extended to SMP, by doing the same for each CPU. This however
* is broken.
*
* Imagine for instance the case where two tasks block on one CPU, only the one
* CPU will have IO-wait accounted, while the other has regular idle. Even
* though, if the storage were faster, both could've ran at the same time,
* utilising both CPUs.
*
* This means, that when looking globally, the current IO-wait accounting on
* SMP is a lower bound, by reason of under accounting.
*
* Worse, since the numbers are provided per CPU, they are sometimes
* interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
* associated with any one particular CPU, it can wake to another CPU than it
* blocked on. This means the per CPU IO-wait number is meaningless.
*
* Task CPU affinities can make all that even more 'interesting'.
*/