Re: [PATCH 2/2] nohz: delayed iowait accounting for nohz idle time stats

From: Hidetoshi Seto
Date: Wed Apr 23 2014 - 03:41:22 EST


(2014/04/23 4:45), Peter Zijlstra wrote:
> On Thu, Apr 17, 2014 at 06:41:41PM +0900, Hidetoshi Seto wrote:
>> [TARGET OF THIS PATCH]:
>>
>> Complete rework for iowait accounting implies that some user
>> interfaces might be replaced completely. It will introduce new
>> counter or so, and kill per-cpu iowait counter which is known to
>> being nonsense. It will take long time to be achieved, considering
>> time to make the problem to a public knowledge, and also time for
>> interface transition. Anyway as long as linux try to be reliable OS,
>> such drastic change must be placed on mainline kernel in development
>> sooner or later.
>>
>> OTOH, drastic change like above is not acceptable for conservative
>> environment, such as stable kernels, some distributor's kernel and
>> so on, due to respect for compatibility. Still these kernel expects
>> per-cpu iowait counter works as well as it might have been.
>> Therefore for such environment, applying a simple patch to fix
>> behavior of counter will be appreciated rather than replacing an
>> over-used interface or changing an existing usage/semantics.
>>
>> This patch targets latter kernels mainly. It does not do too much,
>> but intend to fix the idle stats counters to be monotonic. I believe
>> that mainline kernel should pick up this patch too, because it
>> surely fix a hidden bug and does not intercept upcoming drastic
>> change.
>>
>> Again, in summary, this patch does not do drastic change to cope
>> with problem 2. But it just fix behavior of idle/iowait counter of
>> /proc/stats.
>>
>> [WHAT THIS PATCH PROPOSED]:
>>
>> The main reason of the bug is that observers have no idea to
>> determine the delta to be idle or iowait at the first place.
>>
>> So I introduced delayed iowait accounting to allow observers to
>> assign delta based on status of observed cpu at idle entry.
>>
>
> So the problem I have with this is that it makes CONFIG_NOHZ=[ny]
> kernels behave quite differently.

It is not true.
There are already differences before applying my patches.
The behavior of NOHZ=y kernel diverged from original since it was born.

Please note that no one complained about this difference.

I just want to fix a counter not to go backward.
It's a simple bug, isn't it?

> Ideally NOHZ=y and NOHZ=n behave the same, my proposed solution the
> other day does just that. This one OTOH seems to generate entirely
> different results between those kernels.

As you know, behaviors of NOHZ=[ny] are both crap because of per-cpu
iowait accounting.

I guess we should say:
Ideally NOHZ=[ny] behave the same "in the proper way."

What you proposed will do too much to make one nonsense to another
nonsense. It will be unhelpful for people...

> It also doesn't really simplify the code; there's quite a bit of
> complexity introduced to paper over this silly stuff.

Don't complicate things. I want to talk about a small simple bug.

a) today's NOHZ=n
- provides per-cpu iowait counter
(nonsense but still loved by innocent userland)

b) today's NOHZ=y
- provides per-cpu iowait counter
- use it's own idle time accounting different from a)
- have a *bug* that counter might go backward

b') NOHZ=y + my patch
- provides per-cpu iowait counter
- use it's own idle time accounting same as b)
- *bug* in b) have gone
- instead accept gap in iowait value from b)
- "pending" will not bloat more than one iowait span

c) unified something for NOHZ=[ny] (you proposed)
- provides per-cpu iowait counter
- use new accounting different from a) and also b)

d) ideal goal (=not designed & realized yet)
- no per-cpu iowait counter any more
- use new proper accounting different from all of above

I just want to make b) to b') by a patch as small as possible.

What you proposed will make both of a) and b) to c).
I think it does too much and changing a) is not required here.
(from my conservative perspective, patch must be non-intrusive)

Our final goal must be making d) to replace all nasty things
around there. But still there is no idea to do that.
And such big jump will not fit to stable environments.

Again, I just want to fix a small bug here...

I believe my patch is enough simple to do a minimum fix.
Please tell me if you have more simple/better way to fix this
long-standing minor bug, not only for mainline in development
but also for conservative stables like distributor's kernel.


Thanks,
H.Seto

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/