Re: [PATCH 02/22 -v7] Add basic support for gcc profilerinstrumentation

From: Mathieu Desnoyers
Date: Mon Feb 04 2008 - 17:52:00 EST


* Steven Rostedt (rostedt@xxxxxxxxxxx) wrote:
>
> On Mon, 4 Feb 2008, Paul E. McKenney wrote:
> > OK, will see what I can do...
> >
> > > On Sat, 2 Feb 2008, Paul E. McKenney wrote:
> > >
> > > > Yep, you have dependencies, so something like the following:
> > > >
> > > > initial state:
> > > >
> > > > struct foo {
> > > > int a;
> > > > };
> > > > struct foo x = { 0 };
> > > > struct foo y = { 0 };
> > > > struct foo *global_p = &y;
> > > > /* other variables are appropriately declared auto variables */
> > > >
> > > > /* No kmalloc() or kfree(), hence no RCU grace periods. */
> > > > /* In the terminology of http://lwn.net/Articles/262464/, we */
> > > > /* are doing only publish-subscribe, nothing else. */
> > > >
> > > > writer:
> > > >
> > > > x.a = 1;
> > > > smp_wmb(); /* or smp_mb() */
> > > > global_p = &x;
> > > >
> > > > reader:
> > > >
> > > > p = global_p;
> > > > ta = p->a;
> > > >
> > > > Both Alpha and aggressive compiler optimizations can result in the reader
> > > > seeing the new value of the pointer (&x) but the old value of the field
> > > > (0). Strange but true. The fix is as follows:
> > > >
> > > > reader:
> > > >
> > > > p = global_p;
> > > > smp_read_barrier_depends(); /* or use rcu_dereference() */
> > > > ta = p->a;
> > > >
> > > > So how can this happen? First note that if smp_read_barrier_depends()
> > > > was unnecessary in this case, it would be unnecessary in all cases.
> > > >
> > > > Second, let's start with the compiler. Suppose that a highly optimizing
> > > > compiler notices that in almost all cases, the reader finds p==global_p.
> > > > Suppose that this compiler also notices that one of the registers (say
> > > > r1) almost always contains this expected value of global_p, and that
> > > > cache pressure ensures that an actual load from global_p almost always
> > > > generates an expensive cache miss. Such a compiler would be within its
> > > > rights (as defined by the C standard) to generate code assuming that r1
> > > > already had the right value, while also generating code to validate this
> > > > assumption, perhaps as follows:
> > > >
> > > > r2 = global_p; /* high latency, other things complete meanwhile */
> > > > ta == r1->a;
> > > > if (r1 != r2)
> > > > ta = r2->a;
> > > >
> > > > Now consider the following sequence of events on a superscalar CPU:
> > >
> > > I think you missed one step here (causing my confusion). I don't want to
> > > assume so I'll try to put in the missing step:
> > >
> > > writer: r1 = p; /* happens to use r1 to store parameter p */
> >
> > You lost me on this one... The writer has only the following three steps:
>
> You're right. I meant "writer: r1 = x;"
>
> >
> > writer:
> >
> > x.a = 1;
> > smp_wmb(); /* or smp_mb() */
> > global_p = &x;
> >
> > Where did the "r1 = p" come from? For that matter, where did "p" come
> > from?
> >
> > > > reader: r2 = global_p; /* issued, has not yet completed. */
> > > > reader: ta = r1->a; /* which gives zero. */
> > > > writer: x.a = 1;
> > > > writer: smp_wmb();
> > > > writer: global_p = &x;
> > > > reader: r2 = global_p; /* this instruction now completes */
> > > > reader: if (r1 != r2) /* and these are equal, so we keep bad ta! */
> > >
> > > Is that the case?
> >
> > Ah! Please note that I am doing something unusual here in that I am
> > working with global variables, as opposed to the normal RCU practice of
> > dynamically allocating memory. So "x" is just a global struct, not a
> > pointer to a struct.
> >
>
> But lets look at a simple version of my original code anyway ;-)
>
> Writer:
>
> void add_op(struct myops *x) {
> /* x->next may be garbage here */
> x->next = global_p;
> smp_wmb();
> global_p = x;
> }
>
> Reader:
>
> void read_op(void)
> {
> struct myops *p = global_p;
>
> while (p != NULL) {
> p->func();
> p = next;
> /* if p->next is garbage we crash */
> }
> }
>
>
> Here, we are missing the read_barrier_depends(). Lets look at the Alpha
> cache issue:
>
>
> reader reads the new version of global_p, and then reads the next
> pointer. But since the next pointer is on a different cacheline than
> global_p, it may have somehow had that in it's cache still. So it uses the
> old next pointer which contains the garbage.
>
> Is that correct?
>
> But I will have to admit, that I can't see how an aggressive compiler
> might have screwed this up. Being that x is a parameter, and the function
> add_op is not in a header file.
>

Tell me if I am mistakened, but applying Paul's explanation to your
example would give (I unroll the loop for clarity) :

Writer:

void add_op(struct myops *x) {
/* x->next may be garbage here */
x->next = global_p;
smp_wmb();
global_p = x;
}

Reader:

void read_op(void)
{
struct myops *p = global_p;

if (p != NULL) {
p->func();
p = p->next;
/*
* Suppose the compiler expects that p->next is likely to be equal to
* p + sizeof(struct myops), uses r1 to store previous p, r2 to store the
* next p and r3 to store the expected value. Let's look at what the
* compiler could do for the next loop iteration.
*/
r2 = r1->next (1)
r3 = r1 + sizeof(struct myops)
r4 = r3->func (2)
if (r3 == r2 && r3 != NULL)
call r4

/* if p->next is garbage we crash */
} else
return;

if (p != NULL) {
p->func();
p = p->next;
/* if p->next is garbage we crash */
} else
return;
.....
}

In this example, we would be reading the expected "r3->func" (2) before
reading the real r1->next (1) value if reads are issued out of order.

Paul, am I correct ? And.. does the specific loop optimization I just
described actually exist ?

Thanks for your enlightenment :)

Mathieu

> -- Steve
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/