Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

From: Denys Vlasenko
Date: Wed Apr 02 2014 - 15:36:29 EST

On Mon, Mar 31, 2014 at 4:08 AM, Hidetoshi Seto
<seto.hidetoshi@xxxxxxxxxxxxxx> wrote:
> There are 2 problems:
> [PROBLEM 1]: there is no exclusive control.
> It is easy to understand that there are 2 different cpu - an
> observing cpu where running a program observing idle cpu's stat
> and an observed cpu where performing idle. It means race can
> happen if observed cpu wakes up while it is observed. Observer
> can accidentally add calculated duration time (say delta) to
> time stats which is just updated by woken cpu. Soon correct
> idle time is returned in next turn, so it will result in
> backward time. Therefore readers must be excluded by writer.
> Then time stats are updated by woken cpu so there are only one
> writer, right? No, unfortunately no. I'm not sure why are they,
> but in some reason the function get_cpu_{idle,iowait}_time_us()
> has ability to update sleeping cpu's time stats from observing
> cpu. From grep of today's kernel tree, this feature is used by
> cpufreq module. Calling this function with this feature in
> periodically manner works like emulating tick for sleeping cpu.

Frederic's patches started by moving all updates
to tick_nohz_stop_idle(), makign the above problem easier -
get_cpu_{idle,iowait}_time_us() are pure readers.

The patches are here:

> [PROBLEM 2]: broken iowait accounting.
> As historical nature, cpu's idle time was accounted as either
> idle or iowait depending on the presence of tasks blocked by
> I/O. No one complain about it for a long time. However:
> > Still trying to wrap my head around it, but conceptually
> > get_cpu_iowait_time_us() doesn't make any kind of sense.
> > iowait isn't per cpu since effectively tasks that aren't
> > running aren't assigned a cpu (as Oleg already pointed out).
> -- Peter Zijlstra
> Now some kernel folks realized that accounting iowait as per-cpu
> does not make sense in SMP world. When we were in traditional
> UP era, cpu is considered to be waiting I/O if it is idle while
> nr_iowait > 0. But in these days with SMP systems, tasks easily
> migrate from a cpu where they issued an I/O to another cpu where
> they are queued after I/O completion.

However, if we would put ourselves into admin's seat, iowait
immediately starts to make sense: for admin, the system state
where a lot of CPU time is genuinely idle is qualitatively different
form the state where a lot of CPU time is "idle" because
we are I/O bound.

Admins probably wouldn't insist that iowait accounting must be
very accurate. I would hazard to guess that admins would settle
for the following rules:

* (idle + iowait) should accurately represent amount of time
CPUs were idle.
* both idle and iowait should never go backwards
* when system is truly idle, only idle should increase
* when system is truly I/O bound on all CPUs, only iowait should increase
* when the situation is in between of the above two cases,
both iowait and idle counters should grow. It's ok if they
represent idle/IO-bound ratio only approximately

> Back to NO_HZ mechanism. Totally terrible thing here is that
> observer need to handle duration "delta" without knowing that
> nr_iowait of sleeping cpu can be changed easily by migration
> even if cpu is sleeping.

How about the following: when CPU enters idle, it remembers
in struct tick_sched->idle_active whether it was "truly" idle
or I/O bound: something like

ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;

Then, when we exit idle, we account entire idle period as
"true" idle or as iowait depending on ts->idle_active value,
regardless of what happened to I/O bound task (whether
it migrated or not).

> [WHAT THIS PATCH PROPOSED]: fix problem 1 first.
> To fix problem 1, this patch adds seqlock for NO_HZ idle
> accounting to avoid possible races between multiple reader/writer.

That's what Frederic proposed too.
However, he thought it adds too much overhead. It adds
a new field, two memory barriers and a RMW cycle
in the updater (tick_nohz_stop_idle),
and similarly two memory barriers in readers too.

How about a slightly different approach.
We take his first two patches:

"nohz: Convert a few places to use local per cpu accesses"
"nohz: Only update sleeptime stats locally"

then on top of that we fix racy access by readers as follows:

updater does not need ts->idle_entrytime field
after it is done. We can reuse it as "updater in progress" flag.
We set it to a sentinel value (say, zero),
then increase idle or iowait, then clear ts->idle_active.
With two memory barriers to ensure other CPUs see
updates exactly in that order:

tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now):
/* Updates the per cpu time idle statistics counters */
delta = ktime_sub(now, ts->idle_entrytime);
- if (nr_iowait_cpu(smp_processor_id()) > 0)
+ ts->idle_entrytime.tv64 = 0;
+ smp_wmb();
+ if (ts->idle_active == 2)
ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+ smp_wmb();
ts->idle_active = 0;

Compared to seqlock approach, we don't need a new field (seqlock counter)
and don't need to increment it twice (two RMW cycles). We have the same
number of wmb's as seqlock approach has - two.

The iowait reader reads these three fields in reverse order,
with correct read barriers:

get_cpu_iowait_time_us(int cpu, u64 *last_update_time):
+ again:
+ active = ACCESS_ONCE(ts->idle_active);
+ smp_rmb();
+ count = ACCESS_ONCE(ts->iowait_sleeptime);
+ if (active == 2) { // 2 means "idle was entered with pending I/O"
+ smp_rmb();
+ start = ACCESS_ONCE(ts->idle_entrytime);
+ }
+ return count;

Compared to seqlock approach, we can avoid second rmb (as shown above)
if we see that idle_active was not set: we know that in this case,
we already have the result in "count", no need to correct it.

Now, if idle_active was set (for iowait case, to 2). We can check the
sentinel in idle_entrytime. If it is there, we are racing with updater.
in this case, we loop back:

+ if (active == 2) {
+ smp_rmb();
+ start = ACCESS_ONCE(ts->idle_entrytime);
+ if (start.tv64 == 0)
+ /* Other CPU is updating the count.
+ * We don't know whether fetched count is valid.
+ */
+ goto again;
+ delta = ktime_sub(now, start);
+ count = ktime_add(count, delta);

If we don't see sentinel, we adjust "count" and we are done.

I will send the full patches shortly.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at