Re: [PATCH 2/4] arch: Add lightweight memory barriers fast_rmb() and fast_wmb()

From: Paul E. McKenney
Date: Mon Nov 17 2014 - 18:55:05 EST


On Mon, Nov 17, 2014 at 01:11:57PM -0800, Alexander Duyck wrote:
> On 11/17/2014 12:18 PM, Paul E. McKenney wrote:
> >On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
> >>There are a number of situations where the mandatory barriers rmb() and
> >>wmb() are used to order memory/memory operations in the device drivers
> >>and those barriers are much heavier than they actually need to be. For
> >>example in the case of PowerPC wmb() calls the heavy-weight sync
> >>instruction when for memory/memory operations all that is really needed is
> >>an lsync or eieio instruction.
> >
> >Is this still the case if one of the memory operations is MMIO? Last
> >I knew, it was not.
>
> This barrier is not meant for use in MMIO operations, for that you
> still need a full barrier as I call out in the documentation
> section. What the barrier does is allow for a lightweight barrier
> for accesses to coherent system memory. So for example many device
> drivers have to perform a read of the descriptor to see if the
> device is done with it. We need an rmb() following that check to
> prevent any other accesses.
>
> Right now on x86 that rmb() becomes an lfence instruction and is
> quite expensive, and as it turns out we don't need it since the x86
> doesn't reorder reads. The same kind of thing applies to PowerPC,
> only in that case we use a sync when what we really need is a
> lwsync.

Would it make sense to have a memory barrier that enforced the
non-store-buffer orderings, that is prior reads before later
reads and writes and prior writes before later writes? This was
discussed earlier this year ((http://lwn.net/Articles/586838/,
https://lwn.net/Articles/588300/). If I recall correctly, one of
the biggest obstacles was the name. ;-)

> >>This commit adds a fast (and loose) version of the mandatory memory
> >>barriers rmb() and wmb(). The prefix to the name is actually based on the
> >>version of the functions that already exist in the mips and tile trees.
> >>However I thought it applicable since it gets at what we are trying to
> >>accomplish with these barriers and somethat implies their risky nature.
> >>
> >>These new barriers are not as safe as the standard rmb() and wmb().
> >>Specifically they do not guarantee ordering between cache-enabled and
> >>cache-inhibited memories. The primary use case for these would be to
> >>enforce ordering of memory reads/writes when accessing cache-enabled memory
> >>that is shared between the CPU and a device.
> >>
> >>It may also be noted that there is no fast_mb(). This is due to the fact
> >>that most architectures didn't seem to have a good way to do a full memory
> >>barrier quickly and so they usually resorted to an mb() for their smp_mb
> >>call. As such there no point in adding a fast_mb() function if it is going
> >>to map to mb() for all architectures anyway.
> >
> >I must confess that I still don't entirely understand the motivation.
>
> The motivation is to provide finer grained barriers. So this
> provides an in-between that allows us to "choose the right hammer".
> In the case of PowerPC it is the difference between sync/lwsync, on
> ARM it is dsb()/dmb(), and on x86 it is lfence/barrier().

Ah, so ARM will motivate a fast_wmb(), given its instruction set.

> >Some problems in PowerPC barrier.h called out below.
> >
> > Thanx, Paul
> >
>
> <snip>
>
> >>diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> >>index 22a969c..fe55dea 100644
> >>--- a/Documentation/memory-barriers.txt
> >>+++ b/Documentation/memory-barriers.txt
> >>@@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
> >> operations" subsection for information on where to use these.
> >>
> >>
> >>+ (*) fast_wmb();
> >>+ (*) fast_rmb();
> >>+
> >>+ These are for use with memory based device I/O to guarantee the ordering
> >>+ of cache-enabled writes or reads with respect to other writes or reads
> >>+ to cache-enabled memory.
> >>+
> >>+ For example, consider a device driver that shares memory with a device
> >>+ and uses a descriptor status value to indicate if the descriptor belongs
> >>+ to the device or the CPU, and a doorbell to notify it when new
> >>+ descriptors are available:
> >>+
> >>+ if (desc->status != DEVICE_OWN) {
> >>+ /* do not read data until we own descriptor */
> >>+ fast_rmb();
> >>+
> >>+ /* read/modify data */
> >>+ read_data = desc->data;
> >>+ desc->data = write_data;
> >>+
> >>+ /* flush modifications before status update */
> >>+ fast_wmb();
> >>+
> >>+ /* assign ownership */
> >>+ desc->status = DEVICE_OWN;
> >>+
> >>+ /* force memory to sync before notifying device via MMIO */
> >>+ wmb();
> >>+
> >>+ /* notify device of new descriptors */
> >>+ writel(DESC_NOTIFY, doorbell);
> >>+ }
> >>+
> >>+ The fast_rmb() allows us guarantee the device has released ownership
> >>+ before we read the data from the descriptor, and he fast_wmb() allows
> >>+ us to guarantee the data is written to the descriptor before the device
> >>+ can see it now has ownership. The wmb() is needed to guarantee that the
> >>+ cache-enabled memory writes have completed before attempting a write to
> >>+ the cache-inhibited MMIO region.
> >>+
> >>+
> >> MMIO WRITE BARRIER
> >> ------------------
>
> The general idea is that the device/CPU follow acquire/release style
> semantics and we need the lightweight barriers to enforce ordering
> to prevent us from accessing the descriptors during those periods
> where we do not own them. As the example shows we still need a full
> barrier when going between MMIO and standard memory. Hopefully that
> is what is conveyed in the documentation bits I have above.

Thank you for the clarification.

> <snip>
>
> >>diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> >>index cb6d66c..f480097 100644
> >>--- a/arch/powerpc/include/asm/barrier.h
> >>+++ b/arch/powerpc/include/asm/barrier.h
> >>@@ -36,22 +36,20 @@
> >>
> >> #define set_mb(var, value) do { var = value; mb(); } while (0)
> >>
> >>-#ifdef CONFIG_SMP
> >>-
> >> #ifdef __SUBARCH_HAS_LWSYNC
> >> # define SMPWMB LWSYNC
> >> #else
> >> # define SMPWMB eieio
> >> #endif
> >>
> >>-#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> >>+#define fast_rmb() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> >>+#define fast_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> >>
> >>+#ifdef CONFIG_SMP
> >> #define smp_mb() mb()
> >>-#define smp_rmb() __lwsync()
> >>-#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> >>+#define smp_rmb() fast_rmb()
> >>+#define smp_wmb() fast_wmb()
> >> #else
> >>-#define __lwsync() barrier()
> >>-
> >> #define smp_mb() barrier()
> >> #define smp_rmb() barrier()
> >> #define smp_wmb() barrier()
> >>@@ -69,10 +67,16 @@
> >> #define data_barrier(x) \
> >> asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
> >>
> >>+/*
> >>+ * The use of smp_rmb() in these functions are actually meant to map from
> >>+ * smp_rmb()->fast_rmb()->LWSYNC. This way if smp is disabled then
> >>+ * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
> >>+ * map to the more heavy-weight sync.
> >>+ */
> >> #define smp_store_release(p, v) \
> >> do { \
> >> compiletime_assert_atomic_type(*p); \
> >>- __lwsync(); \
> >>+ smp_rmb(); \
> >
> >This is not good at all. For smp_store_release(), we absolutely
> >must order prior loads and stores against the assignment on the following
> >line. This is not something that smp_rmb() does, nor is it something
> >that smp_wmb() does. Yes, it might happen to now, but this could easily
> >break in the future -- plus this change is extremely misleading.
> >
> >The original __lwsync() is much more clear.
>
> The problem I had with __lwsync is that it really wasn't all that
> clear. It was the lwsync instruction if SMP was enabled, otherwise
> it was just a barrier call. What I did is move the definition of
> __lwsync in the SMP case into fast_rmb, which in turn is accessed by
> smp_rmb. I tried to make this clear in the comment just above the
> two calls. The resultant assembly code should be exactly the same.
>
> What I could do is have it added back as a smp_lwsync if that works
> for you. That way there is something there to give you a hint that
> it becomes a barrier() call as soon as SMP is disabled.

An smp_lwsync() would be a great improvement!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/