Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
From: Denys Vlasenko
Date: Fri Apr 04 2014 - 13:03:16 EST
On Fri, Apr 4, 2014 at 6:03 PM, Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
>> However, if we would put ourselves into admin's seat, iowait
>> immediately starts to make sense: for admin, the system state
>> where a lot of CPU time is genuinely idle is qualitatively different
>> form the state where a lot of CPU time is "idle" because
>> we are I/O bound.
>>
>> Admins probably wouldn't insist that iowait accounting must be
>> very accurate. I would hazard to guess that admins would settle
>> for the following rules:
>
> Iowait makes sense but not per cpu. Eventually it's a global
> stat. Or per task.
There a lot of situations where admins want to know
how much, on average, their CPUs are idle because
they wait for IO.
If you are running, say, a Web cache,
you need to know that stat in order to be able to
conjecture "looks like I'm IO bound, perhaps caching
some data in RAM will speed it up".
Global stat will give such data to admin. Per-task won't -
there can be an efficient Web cache design which uses
many parallel tasks to hide IO latency. Thus, such
Web cache can be nearly optimal despite its tasks,
individually, having significant iowait counts each.
> I've sratched my head a lot on this. And I think we can't continue
> with the current semantics. If we had to keep the current semantics
> and enforce correctness at the same time, we are going to run into
> big scalability and performance issues. This can't be done without
> locking updates to nr_iowait() with seqlock:
>
> * when a task blocks on IO and goes idle, lock some per cpu iowait_seq,
> increase nr_iowait, save curr CPU number, save time.
>
> * when a task io completes and it gets enqueued on another CPU: retrieve
> old CPU, lock its iowait_seq, decrease nr_iowait, flush delta iowait .
>
> And all that just to maintain stats which semantics are wrong, this
> would be pure madness.
>
> OTOH we must stay compatible with user ABI in /proc/stat (the one in /proc/timers_list
> matters less). But if we make it a per task stat, we are free to account it
> on the CPU we want.
>
> So what we can do for example is to account it per task and update stats
> on the CPU where the blocking task wakes up. This way we guarantee
> that we only account locally, which is good for scalability.
When IO-bound task wakes on some CPU,
how exactly do you propose to update counters -
add total waited time of this task to this CPU's counter?
Is such counter meaningful for the admin?
> This is going to be an ABI change on a /proc/stat field semantic.
> We usually can not do that as it can break userspace. But I think
> we have a reasonable exception here:
>
> 1) On a performance POV we don't have the choice.
>
> 2) It has always been a buggy stat on SMP. Given the possible fast iowait update
> rate, I doubt it has ever dumped correct stats. So I guess that few user apps
> have ever correctly relied on it.
In busybox project, the following tools use iowait counter:
top,mpstat: in order to show "%iowait"
nmeter: to show "waiting for disk" part of CPU bar.
Example:
$ nmeter '%t %70c'
18:57:33 SUU...................................................................
18:57:34 SUUUUUUUUUUI..........................................................
18:57:35 SUUUII................................................................
18:57:36 SUUU..................................................................
18:57:37 SSUDDD................................................................
(^^^^^^ IO-intensive task starts)
18:57:38 SSSSSSUDDDDDDDDDDDDIi.................................................
18:57:39 SSSSSSSSUDDDDDDDDDi...................................................
18:57:40 SSSSSUUDDDDDDDDDDDDi..................................................
18:57:41 SSSSSUUUUUDDDDDDDDDDDDi...............................................
18:57:42 SSSSSUDDDDDDDDDDDDDIi.................................................
18:57:43 SSSUUDDDDDDDi.........................................................
(^^^^^^ IO-intensive task ends)
18:57:44 SUUUI.................................................................
18:57:45 SUUU..................................................................
18:57:46 UU....................................................................
18:57:47 U.....................................................................
This doesn't look bogus to me.
It does give me information I need to know.
> Also it decouples iowait from idle time. Running time is also accounted
> as iowait.
The time when CPUs are busy while there is IO-wait
are usually not a sign of badly tuned software/system.
Only when CPUs are idle and there is IO-wait is.
That's how it looks from userspace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/