Re: [RFC PATCH] introduce sys_membarrier(): process-wide memorybarrier

From: Paul E. McKenney
Date: Sun Jan 10 2010 - 20:17:14 EST


On Sun, Jan 10, 2010 at 01:24:23PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@xxxxxxxxxxxxxxxxxx) wrote:
> > On Sun, Jan 10, 2010 at 06:50:09AM -0500, Steven Rostedt wrote:
> > > On Sat, 2010-01-09 at 21:25 -0800, Paul E. McKenney wrote:
> > > > On Sat, Jan 09, 2010 at 09:12:58PM -0500, Steven Rostedt wrote:
> > >
> > > > > > < user space >
> > > > > >
> > > > > > < misses that CPU 2 is in rcu section >
> > > > >
> > > > >
> > > > > If the TLB flush misses that CPU 2 has a threaded task, and does not
> > > > > flush CPU 2s TLB, it can also risk the same type of crash.
> > > >
> > > > But isn't the VM's locking helping us out in that case?
> > > >
> > > > > > [CPU 2's ->curr update now visible]
> > > > > >
> > > > > > [CPU 2's rcu_read_lock() store now visible]
> > > > > >
> > > > > > free(obj);
> > > > > >
> > > > > > use_object(obj); <=== crash!
> > > > > >
> > > > >
> > > > > Think about it. If you change a process mmap, say you updated a mmap of
> > > > > a file by flushing out one page and replacing it with another. If the
> > > > > above missed sending to CPU 2, then CPU 2 may still be accessing the old
> > > > > page of the file, and not the new one.
> > > > >
> > > > > I think this may be the safe bet.
> > > >
> > > > You might well be correct that we can access that bitmap locklessly,
> > > > but there are additional things (like the loading of the arch-specific
> > > > page-table register) that are likely to be helping in the VM case, but
> > > > not necessarily helping in this case.
> > >
> > > Then perhaps the sys_membarrier() should just do a flush_tlb()? That
> > > should guarantee the synchronization, right?
> >
> > Isn't just grabbing the runqueue locks a bit more straightforward and
> > more clearly correct? Again, this is the URCU slowpath, so it is hard
> > to get too excited about a 15% performance penalty.
>
> Even when taking the spinlocks, efficient iteration on active threads is
> done with for_each_cpu(cpu, mm_cpumask(current->mm)), which depends on
> the same cpumask, and thus requires the same memory barriers around the
> updates.

Ouch!!! Good point and good catch!!!

> We could switch to an inefficient iteration on all online CPUs instead,
> and check read runqueue ->mm with the spinlock held. Is that what you
> propose ? This will cause reading of large amounts of runqueue
> information, especially on large systems running few threads. The other
> way around is to iterate on all the process threads: in this case, small
> systems running many threads will have to read information about many
> inactive threads, which is not much better.

I am not all that worried about exactly what we do as long as it is
pretty obviously correct. We can then improve performance when and as
the need arises. We might need to use any of the strategies you
propose, or perhaps even choose among them depending on the number of
threads in the process, the number of CPUs, and so forth. (I hope not,
but...)

My guess is that an obviously correct approach would work well for a
slowpath. If someone later runs into performance problems, we can fix
them with the added knowledge of what they are trying to do.

Thanx, Paul

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/