Re: [PATCH v17 1/2] sys_membarrier(): system-wide memory barrier (generic, x86)

From: josh
Date: Tue May 05 2015 - 19:11:51 EST


On Tue, May 05, 2015 at 06:25:12PM +0000, Mathieu Desnoyers wrote:
> ----- Original Message -----
> > On Mon, May 04, 2015 at 05:00:12PM -0400, Mathieu Desnoyers wrote:
> > > * Benchmarks
> > >
> > > On Intel Xeon E5405 (8 cores)
> > > (one thread is calling sys_membarrier, the other 7 threads are busy
> > > looping)
> > >
> > > 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
> > >
> > > * User-space user of this system call: Userspace RCU library
> > >
> > > Both the signal-based and the sys_membarrier userspace RCU schemes
> > > permit us to remove the memory barrier from the userspace RCU
> > > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
> > > accelerating them. These memory barriers are replaced by compiler
> > > barriers on the read-side, and all matching memory barriers on the
> > > write-side are turned into an invocation of a memory barrier on all
> > > active threads in the process. By letting the kernel perform this
> > > synchronization rather than dumbly sending a signal to every process
> > > threads (as we currently do), we diminish the number of unnecessary wake
> > > ups and only issue the memory barriers on active threads. Non-running
> > > threads do not need to execute such barrier anyway, because these are
> > > implied by the scheduler context switches.
> > >
> > > Results in liburcu:
> > >
> > > Operations in 10s, 6 readers, 2 writers:
> > >
> > > memory barriers in reader: 1701557485 reads, 3129842 writes
> > > signal-based scheme: 9825306874 reads, 5386 writes
> > > sys_membarrier: 7992076602 reads, 220 writes
> > >
> > > The dynamic sys_membarrier availability check adds some overhead to
> > > the read-side compared to the signal-based scheme, but besides that,
> > > with the expedited scheme, we can see that we are close to the read-side
> > > performance of the signal-based scheme. However, this non-expedited
> > > sys_membarrier implementation has a much slower grace period than signal
> > > and memory barrier schemes.
> > >
> > > An expedited version of this system call can be added later on to speed
> > > up the grace period. Its implementation will likely depend on reading
> > > the cpu_curr()->mm without holding each CPU's rq lock.
> >
> > So, I realize that there's a lot of history tied up in the previous 16
> > versions and associated mail threads. However, can you please summarize
> > in the commit message what the benefit of merging this version is?
> > Because from the text above, from liburcu's perspective, it appears to
> > be strictly worse in performance than the signal-based scheme.
> >
> > There are other non-performance reasons why it might make sense to
> > include this; for instance, signals don't play nice with libraries, with
> > other processes you might inject yourself into for tracing purposes, or
> > with general sanity. However, the explanation for those use cases and
> > how membarrier() improves them needs to go in the commit message, rather
> > than only in the collective memory and mail archives of people who have
> > discussed this patch series.
> >
> > (My apologies if the if the explanation is in the commit message and
> > I've just missed it.)
>
> I will add info about signals vs libraries, which appears to be missing
> from the commit message:
>
> "Besides diminishing the number of wake-ups, one major advantage of the
> membarrier system call over the signal-based scheme is that it does not
> need to reserve a signal. This plays much more nicely with libraries,
> and with processes injected into for tracing purposes, for which we
> cannot expect that signals will be unused by the application."
>
> The commit message already point out that sys_membarrier diminishes the
> number of unnecessary wake-ups sent to other threads compared to the
> signal-based approach.
>
> I re-ran those tests on urcu master branch with a slightly modified
> version of the sys_membarrier scheme too: a version which assumes that
> sys_membarrier is always available. We can then compare apples to
> apples performance-wise between signal and membarrier approaches:
>
> Results in liburcu:
>
> Operations in 10s, 6 readers, 2 writers:
>
> memory barriers in reader: 1701557485 reads, 3129842 writes
> signal-based scheme: 9830061167 reads, 6700 writes
> sys_membarrier: 9952759104 reads, 425 writes
> sys_membarrier (dyn. check): 7970328887 reads, 425 writes
>
> It shows that sys_membarrier read-side actually performs slightly
> better than the signal-based scheme, in the absence of dynamic
> check for syscall availability. This could be enhanced in userspace
> eventually if we decide to implement self-modifying code upon
> feature detection in liburcu. I'll update the commit message with
> this new table.

That's *much* better, thank you.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/