Re: [PATCH] percpu-rwsem: use barrier in unlock path

From: Mikulas Patocka
Date: Thu Oct 18 2012 - 12:55:10 EST




On Tue, 16 Oct 2012, Linus Torvalds wrote:

> [ Architecture people, note the potential new SMP barrier! ]
>
> On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > + /*
> > + * The lock is considered unlocked when p->locked is set to false.
> > + * Use barrier prevent reordering of operations around p->locked.
> > + */
> > +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
> > + barrier();
> > +#else
> > + smp_mb();
> > +#endif
> > p->locked = false;
>
> Ugh. The #if is too ugly to live.

One instance of this is already present in the code.

I suggest that you drop this patch and use synchronize_rcu() that I
have just posted.

But another instance of this "#if defined(CONFIG_X86) && ..." is already
there in percpu_up_read.


What is the procedure for making changes that require support of
architectures? It is trivial to make a patch that moves this into
arch-specific includes, the problem is that the patch break all the
architectures - I wrote support for x86, sparc, parisc, alpha (I can test
those) but not the others.

Are you going to apply this patch, break majority of architectures and
wait until architecture maintainers fix it? Or is there some arch-specific
git tree, where the patch should be applied and where the maintainers fix
things?

> This is a classic case of "people who write their own serialization
> primitives invariably get them wrong". And this fix is just horrible,
> and code like this should not be allowed.
>
> I suspect we could make the above x86-specific optimization be valid
> by introducing a new barrier, called "smp_release_before_store()" or
> something like that, which on x86 just happens to be a no-op (but
> still a compiler barrier). That's because earlier reads will not pass
> a later stores, and stores are viewed in program order, so all x86
> stores have "release consistency" modulo the theoretical PPro bugs
> (that I don't think we actually ever saw in practice).
>
> But it is possible that that is true on other architectures too, so
> your hand-crafted x86-specific #if is not just ugly, it's liable to
> get worse.
>
> The alternative is to just say that we should use "smp_mb()"
> unconditionally, depending on how critical this code actually ends up
> being.
>
> Added linux-arch in case architecture-maintainers have comments on
> "smp_release_before_store()" thing. It would be kind of similar to the
> odd "smp_read_barrier_depends()", except it would normally be a full
> memory barrier, except on architectures where a weaker barrier might
> be sufficient.
>
> I suspect there may be many architectures where a "smp_wmb()" is
> sufficient for this case, for the simple reason that no sane
> microarchitecture would *ever* move earlier reads down past a later
> write,

Alpha can move reads down past writes (if the read is not in cache and the
write hits the cache, it may make sense to do this reordering).

> so release consistency really only needs the local writes to be
> ordered, not the full memory ordering.
>
> Arch people?
>
> The more optimal solution may be to mark the store *itself* to be
> "store with release consistency", which on x86 would be a regular
> store (with the compiler barrier), but on other architectures may be a
> special memory operation. On architectures with
> release/aqcuire-consistency, there's not a separate barrier before the
> store, the store instruction itself is done with special semantics. So
> maybe the right thing to do is
>
> #define smp_release_consistency_store(val, ptr) ...
>
> where on x86, the implementation would be a simple
>
> do { barrier(); *(ptr)=(val); } while (0)

smp_release_consistency_store doesn't work. In
include/linux/percpu-rwsem.h it is required that
"this_cpu_dec(*p->counters);" works as an unlock barrier. So we could do
either.

"smp_mb(); this_cpu_dec(*p->counters);" - generates pointless barrier on x86

"#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) &&
!defined(CONFIG_X86_OOSTORE))
barrier();
#else
smp_mb();
#endif
this_cpu_dec(*p->counters);
" - removes the barrier on the most common architecture (X86), but the
code looks dirty

"smp_release_before_store();this_cpu_dec(*p->counters);" - the code is
clean, but the downside is that it breaks architectures that don't have
smp_release_before_store().

> but on other architectures it might be a inline asm with the required
> magic store-with-release instruction.
>
> How important is this code sequence? Is the "percpu_up_write()"
> function really so critical that we can't have an extra memory
> barrier?

percpu_up_write() is not critical. But percpu_up_read() is critical and it
should be as fast as possible.

Mikulas

> Or do people perhaps see *other* places where
> release-consistency-stores might be worth doing?
>
> But in no event do I want to see that butt-ugly #if statement.
>
> Linus
>

---
Introduce smp_release_before_store()

smp_release_before_store() with the following store operation works as
an unlock barrier - that is, memory accesses may be moved before the
unlock barrier, but no memory accesses are moved past the memory
barrier.

On x86 writes are strongly ordered (unless CONFIG_X86_PPRO_FENCE or
CONFIG_X86_OOSTORE is selected), so smp_release_before_store() defaults
to a compiler barrier (and no barrier instruction). On other
architectures it may need more heavyweight barriers.

Signed-off-by: Mikulas Patocka <mpatocka@xxxxxxxxxx>

---
Documentation/memory-barriers.txt | 5 +++++
arch/alpha/include/asm/barrier.h | 2 ++
arch/parisc/include/asm/barrier.h | 1 +
arch/sparc/include/asm/barrier_32.h | 1 +
arch/sparc/include/asm/barrier_64.h | 2 ++
arch/x86/include/asm/barrier.h | 6 ++++++
include/linux/percpu-rwsem.h | 13 +------------
7 files changed, 18 insertions(+), 12 deletions(-)

Index: linux-3.6.2-fast/arch/alpha/include/asm/barrier.h
===================================================================
--- linux-3.6.2-fast.orig/arch/alpha/include/asm/barrier.h 2012-10-18 17:45:12.000000000 +0200
+++ linux-3.6.2-fast/arch/alpha/include/asm/barrier.h 2012-10-18 17:46:24.000000000 +0200
@@ -29,6 +29,8 @@ __asm__ __volatile__("mb": : :"memory")
#define smp_read_barrier_depends() do { } while (0)
#endif

+#define smp_release_before_store() smp_mb()
+
#define set_mb(var, value) \
do { var = value; mb(); } while (0)

Index: linux-3.6.2-fast/arch/parisc/include/asm/barrier.h
===================================================================
--- linux-3.6.2-fast.orig/arch/parisc/include/asm/barrier.h 2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/arch/parisc/include/asm/barrier.h 2012-10-18 17:46:24.000000000 +0200
@@ -29,6 +29,7 @@
#define smp_wmb() mb()
#define smp_read_barrier_depends() do { } while(0)
#define read_barrier_depends() do { } while(0)
+#define smp_release_before_store() mb ()

#define set_mb(var, value) do { var = value; mb(); } while (0)

Index: linux-3.6.2-fast/arch/sparc/include/asm/barrier_32.h
===================================================================
--- linux-3.6.2-fast.orig/arch/sparc/include/asm/barrier_32.h 2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/arch/sparc/include/asm/barrier_32.h 2012-10-18 17:56:48.000000000 +0200
@@ -11,5 +11,6 @@
#define smp_rmb() __asm__ __volatile__("":::"memory")
#define smp_wmb() __asm__ __volatile__("":::"memory")
#define smp_read_barrier_depends() do { } while(0)
+#define smp_release_before_store() smp_mb()

#endif /* !(__SPARC_BARRIER_H) */
Index: linux-3.6.2-fast/arch/sparc/include/asm/barrier_64.h
===================================================================
--- linux-3.6.2-fast.orig/arch/sparc/include/asm/barrier_64.h 2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/arch/sparc/include/asm/barrier_64.h 2012-10-18 17:46:24.000000000 +0200
@@ -53,4 +53,6 @@ do { __asm__ __volatile__("ba,pt %%xcc,

#define smp_read_barrier_depends() do { } while(0)

+#define smp_release_before_store() __asm__ __volatile__("":::"memory")
+
#endif /* !(__SPARC64_BARRIER_H) */
Index: linux-3.6.2-fast/arch/x86/include/asm/barrier.h
===================================================================
--- linux-3.6.2-fast.orig/arch/x86/include/asm/barrier.h 2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/arch/x86/include/asm/barrier.h 2012-10-18 17:46:24.000000000 +0200
@@ -100,6 +100,12 @@
#define set_mb(var, value) do { var = value; barrier(); } while (0)
#endif

+#ifdef CONFIG_X86_PPRO_FENCE
+#define smp_release_before_store() smp_mb()
+#else
+#define smp_release_before_store() smp_wmb()
+#endif
+
/*
* Stop RDTSC speculation. This is needed when you need to use RDTSC
* (or get_cycles or vread that possibly accesses the TSC) in a defined
Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h
===================================================================
--- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h 2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/include/linux/percpu-rwsem.h 2012-10-18 17:46:24.000000000 +0200
@@ -28,18 +28,7 @@ static inline void percpu_down_read(stru

static inline void percpu_up_read(struct percpu_rw_semaphore *p)
{
- /*
- * On X86, write operation in this_cpu_dec serves as a memory unlock
- * barrier (i.e. memory accesses may be moved before the write, but
- * no memory accesses are moved past the write).
- * On other architectures this may not be the case, so we need smp_mb()
- * there.
- */
-#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
- barrier();
-#else
- smp_mb();
-#endif
+ smp_release_before_store();
this_cpu_dec(*p->counters);
}

Index: linux-3.6.2-fast/Documentation/memory-barriers.txt
===================================================================
--- linux-3.6.2-fast.orig/Documentation/memory-barriers.txt 2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/Documentation/memory-barriers.txt 2012-10-18 17:46:24.000000000 +0200
@@ -1056,11 +1056,16 @@ The Linux kernel has eight basic CPU mem
WRITE wmb() smp_wmb()
READ rmb() smp_rmb()
DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends()
+ UNLOCK BARRIER - smp_release_before_store()


All memory barriers except the data dependency barriers imply a compiler
barrier. Data dependencies do not impose any additional compiler ordering.

+smp_release_before_store() together with the following store operation implies
+an unlock barrier. That is - memory accesses may be moved before the unlock
+barrier, but no memory accesses are moved past the unlock barrier.
+
Aside: In the case of data dependencies, the compiler would be expected to
issue the loads in the correct order (eg. `a[b]` would have to load the value
of b before loading a[b]), however there is no guarantee in the C specification
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/