Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low

From: Alan Jenkins
Date: Fri Jul 05 2019 - 09:37:52 EST

Next message: Nadav Amit: "Re: [PATCH] KVM: LAPIC: ARBPRI is a reserved register for x2APIC"
Previous message: Dmitry Vyukov: "Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode"
In reply to: Peter Zijlstra: "Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 05/07/2019 12:38, Peter Zijlstra wrote:

On Fri, Jul 05, 2019 at 12:25:46PM +0100, Alan Jenkins wrote:

Hi, scheduler experts!

My cpu "iowait" time appears to be reported incorrectly.Â Do you know why
this could happen?

Because iowait is a magic random number that has no sane meaning.
Personally I'd prefer to just delete the whole thing, except ABI :/

Also see the comment near nr_iowait():

/*
* IO-wait accounting, and how its mostly bollocks (on SMP).
*
* The idea behind IO-wait account is to account the idle time that we could
* have spend running if it were not for IO. That is, if we were to improve the
* storage performance, we'd have a proportional reduction in IO-wait time.
*
* This all works nicely on UP, where, when a task blocks on IO, we account
* idle time as IO-wait, because if the storage were faster, it could've been
* running and we'd not be idle.
*
* This has been extended to SMP, by doing the same for each CPU. This however
* is broken.
*
* Imagine for instance the case where two tasks block on one CPU, only the one
* CPU will have IO-wait accounted, while the other has regular idle. Even
* though, if the storage were faster, both could've ran at the same time,
* utilising both CPUs.
*
* This means, that when looking globally, the current IO-wait accounting on
* SMP is a lower bound, by reason of under accounting.
*
* Worse, since the numbers are provided per CPU, they are sometimes
* interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
* associated with any one particular CPU, it can wake to another CPU than it
* blocked on. This means the per CPU IO-wait number is meaningless.
*
* Task CPU affinities can make all that even more 'interesting'.
*/

Thanks. I take those as being different problems, but you mean there is not much demand (or point) to "fix" my issue.

(2) Compare running "dd" with "taskset -c 1":

%Cpu1Â :Â 0.3 us,Â 3.0 sy,Â 0.0 ni, 83.7 id, 12.6 wa,Â 0.0 hi,Â 0.3 si,Â 0.0 st

^ non-zero idle time for Cpu1, despite the pinned IO hog.

The block layer recently decided they could break "disk busy%" reporting for slow devices (mechanical HDD), in order to reduce overheads for fast devices.Â This means the summary view in "atop" now lacks any reliable indicator.

I suppose I need to look in "iotop".

The new /proc/pressure/io seems to have caveats related to the iowait issues... it seems even more complex to interpret for this case, and it does not seem to work how I think it does.[1]

Regards
Alan

[1] https://unix.stackexchange.com/questions/527342/why-does-the-new-linux-pressure-stall-information-for-io-not-show-as-100/

Next message: Nadav Amit: "Re: [PATCH] KVM: LAPIC: ARBPRI is a reserved register for x2APIC"
Previous message: Dmitry Vyukov: "Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode"
In reply to: Peter Zijlstra: "Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]