Re: [PATCH 3/4 v2] nohz: Fix idle/iowait counts going backwards
From: Peter Zijlstra
Date: Wed May 07 2014 - 15:07:23 EST
On Wed, May 07, 2014 at 08:24:25PM +0200, Denys Vlasenko wrote:
> On 05/07/2014 06:56 PM, Peter Zijlstra wrote:
> > On Wed, May 07, 2014 at 06:49:47PM +0200, Denys Vlasenko wrote:
> >> On 05/07/2014 04:23 PM, Peter Zijlstra wrote:
> >>> On Wed, May 07, 2014 at 03:41:33PM +0200, Denys Vlasenko wrote:
> >>>> With this change, "iowait-ness" of every idle period is decided
> >>>> at the moment it starts:
> >>>> if this CPU's run-queue had tasks waiting on I/O, then this idle
> >>>> period's duration will be added to iowait_sleeptime.
> >>>>
> >>>> This fixes the bug where iowait and/or idle counts could go backwards,
> >>>> but iowait accounting is not precise (it can show more iowait
> >>>> that there really is).
> >>>>
> >>>
> >>> NAK on this, the thing going backwards is a symptom of the bug, not an
> >>> actual bug itself.
> >>
> >> This patch does fix that bug.
> >
> > Which bug, there's two here:
> >
> > 1) that NOHZ and !NOHZ iowait accounting aren't identical
>
> They can hardly be identical, considering how different these modes are.
They can, we've managed it for pretty much everything else, although its
not always easy.
And if you look at the patch I send, that provides the exact moment the
task wakes up, so you can round that to the nearest jiffy boundary and
account appropriately as if it were accounted on the per-cpu timer tick.
Now, there's likely fun corner cases which need more TLC, see
kernel/sched/proc.c for the fun times we had with the global load avg.
> And they don't have to be identical, in fact.
Yes they have to; per definition. CONFIG_NOHZ should have no user
visible difference (except of course the obvious of less interrupts and
ideally energy usage).
> > 2) that iowait accounting in general is a steaming pile of crap
>
> If you want to nuke iowait (for example, make its counter constant 0),
> I personally won't object. Can't guarantee others won't...
I won't object to a constant 0, but then we have to do it irrespective
of NOHZ. But not necessarily, I think we can have a coherent definition
of iowait, just most likely not per-cpu.
So for UP we have the very simple definition that any idle cycle while
there is a task waiting for io is accounted to iowait.
This definition can be 'trivially' extended to a global iowait,
expensive to compute though.
However, one can argue its not correct to do that trivial extension,
since if there's only 1 task waiting for io, it could at most tie up 1
CPUs worth of idle time (but very emphatically not a specific cpu).
So somewhere in that space is I think a viable way to account iowait,
but the straight fwd implementation (in as far as the eventual
definition will be straight fwd to begin with) will likely be
prohibitively expensive to compute.
Then again, if you look at kernel/sched/proc.c (again) and look at the
bloody mess we had to make for the global load avg accounting to work
for NOHZ there might (or might just not) be a shimmer of hope we can
pull this off in a scalable manner.
Like said, 'fun' problem :-)
Attachment:
pgp5RmHVixEaE.pgp
Description: PGP signature