Re: [RFC PATCH] introduce sys_membarrier(): process-wide memorybarrier (v5)

From: Peter Zijlstra
Date: Tue Jan 19 2010 - 11:49:23 EST


On Thu, 2010-01-14 at 11:26 -0500, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@xxxxxxxxxxxxx) wrote:
> > On Wed, 2010-01-13 at 14:36 -0500, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@xxxxxxxxxxxxx) wrote:
> > > > On Tue, 2010-01-12 at 20:37 -0500, Mathieu Desnoyers wrote:
> > > > > + for_each_cpu(cpu, tmpmask) {
> > > > > + spin_lock_irq(&cpu_rq(cpu)->lock);
> > > > > + mm = cpu_curr(cpu)->mm;
> > > > > + spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > > > + if (current->mm != mm)
> > > > > + cpumask_clear_cpu(cpu, tmpmask);
> > > > > + }
> > > >
> > > > Why not:
> > > >
> > > > rcu_read_lock();
> > > > if (current->mm != cpu_curr(cpu)->mm)
> > > > cpumask_clear_cpu(cpu, tmpmask);
> > > > rcu_read_unlock();
> > > >
> > > > the RCU read lock ensures the task_struct obtained remains valid, and it
> > > > avoids taking the rq->lock.
> > > >
> > >
> > > If we go for a simple rcu_read_lock, I think that we need a smp_mb()
> > > after switch_to() updates the current task on the remote CPU, before it
> > > returns to user-space. Do we have this guarantee for all architectures ?
> > >
> > > So what I'm looking for, overall, is:
> > >
> > > schedule()
> > > ...
> > > switch_mm()
> > > smp_mb()
> > > clear mm_cpumask
> > > set mm_cpumask
> > > switch_to()
> > > update current task
> > > smp_mb()
> > >
> > > If we have that, then the rcu_read_lock should work.
> > >
> > > What the rq lock currently gives us is the guarantee that if the current
> > > thread changes on a remote CPU while we are not holding this lock, then
> > > a full scheduler execution is performed, which implies a memory barrier
> > > if we change the current thread (it does, right ?).
> >
> > I'm not quite seeing it, we have 4 possibilities, switches between
> > threads with:
> >
> > a) our mm, another mm
> >
> > - if we observe the former, we'll send an IPI (redundant)
> > - if we observe the latter, the switch_mm will have issued an mb
> >
> > b) another mm, our mm
> >
> > - if we observe the former, we're good because the cpu didn't run our
> > thread when we called sys_membarrier()
> > - if we observe the latter, we'll send an IPI (redundant)
>
> It's this scenario that is causing problem. Let's consider this
> execution:
>
> CPU 0 (membarrier) CPU 1 (another mm -> our mm)
> <kernel-space> <kernel-space>
> switch_mm()
> smp_mb()
> clear_mm_cpumask()
> set_mm_cpumask()
> smp_mb() (by load_cr3() on x86)
> switch_to()
> mm_cpumask includes CPU 1
> rcu_read_lock()
> if (CPU 1 mm != our mm)
> skip CPU 1.
> rcu_read_unlock()
> current = next (1)

OK, so on x86 current uses esp and will be flipped somewhere in the
switch_to() magic, cpu_curr(cpu) as used by CPU 0 uses rq->curr, which
will be set before context_switch() and that always implies a mb() for
non matching ->mm's [*]

> <switch back to user-space>
> read-lock()
> read gp, store local gp
> barrier()
> access critical section (2)
>
> So if we don't have any memory barrier between (1) and (2), the memory
> operations can be reordered in such a way that CPU 0 will not send IPI
> to a CPU that would need to have it's barrier() promoted into a
> smp_mb().

OK, so I'm utterly failing to make sense of the above, do you need more
than the 2 cpus discussed to make it go boom?

> Replacing these kernel rcu_read_lock/unlock() by rq locks ensures that
> when the scheduler runs concurrently on another CPU, _all_ the scheduling
> code is executed atomically wrt the spin lock taken on cpu 0.

Sure, but taking the rq->lock is fairly heavy handed.

> When x86 uses iret to return to user-space, then we have a serializing
> instruction. But if it uses sysexit, or if we are on a different
> architecture, are we sure that a memory barrier is issued before
> returning to user-space ?

[*] and possibly also for matching ->mm's, because:

OK, so I had a quick look at the switch_to() magic, and from what I can
make of it it implies an mb, if only because poking at the segment
registers implies LOCK semantics.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/