Re: [PATCH] cpuacct: add a branch prediction

From: Paul E. McKenney
Date: Thu Feb 26 2009 - 11:45:30 EST


On Thu, Feb 26, 2009 at 09:06:24PM +0900, KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra wroteï
> > On Thu, 2009-02-26 at 20:17 +0900, KAMEZAWA Hiroyuki wrote:
> >> Peter Zijlstra wroteï
> >> > On Thu, 2009-02-26 at 19:28 +0900, KAMEZAWA Hiroyuki wrote:
> >> >
> >> >> Taking hierarchy mutex while reading will make read-side stable.
> >> >
> >> > We're talking about scheduling here, taking a mutex to stop scheduling
> >> > won't work, nor will it be acceptible to use anything that will.
> >> >
> >> No mutex is necessary, anyway.
> >> hierarchy-walker function completely works well under rcu read lock,
> >> if small jitter is allowed.
> >
> > Right, should be doable -- and looking at the code, we have this
> > horrible 32 bit exception in there that locks the rq in order to read
> > the 64bit value.
> >
> > Would be grand to get rid of that,. how bad would it be for userspace to
> > get the occasionally fubarred value?
> >
> >From view of user-support saler, if terrible broken value is reported,
> it will be user-incident and annoy me(us) ;)
>
> I'd like to get rid of rq->lock, too..Hmm.. some routine like
> atomic64_read() can help this ? (But I don't want to use atomic_t here..)

atomic64_read() will not help you on a 32-bit machine. Here is the
sequence of events that will cause the aforementioned user incidents and
consequent annoyance:

o The value of the counter is (2^32)-1, or 0xffffffff.

o CPU 0 reads the high-order 32 bits of the counter, getting zero.

o CPU 1 increments the low-order 32 bits of the counter, resulting
in zero, but notes that there is a carry out of this field.

o CPU 0 reads the low-order 32 bits of the counter, getting zero.

o CPU 1 increments the high-order 32 bits of the counter, so that
the new value of the counter is 2^32, or 0x100000000.

So CPU 0 gets a value that is -way- off.

The usual trick is something like the following for counter read:

1. Read the high-order 32 bits of the counter.

2. Do a memory barrier, smp_mb().

3. Read the low-order 32 bits of the counter.

4. Do another memory barrier, again smp_mb().

5. Read the high-order 32 bits of the counter again.

If it is the same as the value obtained in step 1 (or the previous
execution of step 5), then we are done. (This works even in case
of complete 64-bit overflow, though we should be very lucky to
live that long!) Otherwise, go to step 2.

But it is also necessary to modify the counter update:

1. Increment the low-order 32 bits of the counter. If no overflow
occurred, we are done, otherwise, continue through this sequence
of steps.

2. Do a memory barrier, smp_mb().

3. Increment the high-order 32 bits of the counter.

How to detect overflow in step 1? Well, if we are incrementing, we can
just test for the new value being zero. Otherwise, if we are adding
a 32-bit number, if the new value of the low-order 32 bits of counter
is less than the old value, overflow occurred (but make sure that the
comparison is unsigned!).

This all assumes that you are adding a 32-bit quantity to the counter.
Adding 64-bit values is not much harder.

Does this approach work for you?

Thanx, Paul

> > But aside from that, the cpu controller itself is also summing directly
> > up the hierarchy, so cpuacct doing the same doesn't seem odd.
> >
> I'll post some idea if I can think of something reasonable.
> But I tend to hesitate to modify sched.c ;)
>
> Thanks,
> -Kame
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/