Re: [RFC PATCH] introduce sys_membarrier(): process-wide memorybarrier (v5)

From: Steven Rostedt
Date: Tue Jan 19 2010 - 12:31:00 EST

On Tue, 2010-01-19 at 17:47 +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-14 at 11:26 -0500, Mathieu Desnoyers wrote:

> > It's this scenario that is causing problem. Let's consider this
> > execution:
> >
> > CPU 0 (membarrier) CPU 1 (another mm -> our mm)
> > <kernel-space> <kernel-space>
> > switch_mm()
> > smp_mb()
> > clear_mm_cpumask()
> > set_mm_cpumask()
> > smp_mb() (by load_cr3() on x86)
> > switch_to()
> > mm_cpumask includes CPU 1
> > rcu_read_lock()
> > if (CPU 1 mm != our mm)
> > skip CPU 1.
> > rcu_read_unlock()
> > current = next (1)
> OK, so on x86 current uses esp and will be flipped somewhere in the
> switch_to() magic, cpu_curr(cpu) as used by CPU 0 uses rq->curr, which
> will be set before context_switch() and that always implies a mb() for
> non matching ->mm's [*]

This explanation by Mathieu still does not show the issue. He finally
explained it correctly here:

The issue is before switch_to. If the read side reads in a pointer but
gets preempted and goes to the kernel, and the kernel updates next
(before the switch_to), and the write side does sys_membarrier(), the:

if (cpu_curr(1)->mm != our mm)

will fail, so no memory barrier will happen yet.

Then the userspace side of the writer will free or change the data, but
the reader side read is still stuck in the bus (still before the
switch_to). Then the reader can get a bad pointer.

Here's a simple explanation:

---------- -----------
<userspace> <userspace>
obj = get_obj();
obj = get_obj(); <load of obj>
<obj still being retrieved from memory>
<interrupt, schedule out>
rq->curr = next;
<bus stall>
<obj still in memory>

<kernel space>
if (curr_cpu(1)->mm != mm) <-- false

modify object

<obj finally is loaded>
<schedule kernel thread>
<schedule task back>
refer(obj) <-- corruption!

> > <switch back to user-space>
> > read-lock()
> > read gp, store local gp
> > barrier()
> > access critical section (2)
> >
> > So if we don't have any memory barrier between (1) and (2), the memory
> > operations can be reordered in such a way that CPU 0 will not send IPI
> > to a CPU that would need to have it's barrier() promoted into a
> > smp_mb().
> OK, so I'm utterly failing to make sense of the above, do you need more
> than the 2 cpus discussed to make it go boom?

Just 2 cpu's as explained above.

> > Replacing these kernel rcu_read_lock/unlock() by rq locks ensures that
> > when the scheduler runs concurrently on another CPU, _all_ the scheduling
> > code is executed atomically wrt the spin lock taken on cpu 0.
> Sure, but taking the rq->lock is fairly heavy handed.

Yes, but for now it seems to be the only safe way.

> > When x86 uses iret to return to user-space, then we have a serializing
> > instruction. But if it uses sysexit, or if we are on a different
> > architecture, are we sure that a memory barrier is issued before
> > returning to user-space ?
> [*] and possibly also for matching ->mm's, because:
> OK, so I had a quick look at the switch_to() magic, and from what I can
> make of it it implies an mb, if only because poking at the segment
> registers implies LOCK semantics.

The problem is that this issue can be caused before we get to

-- Steve

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at