[PATCH v4 5/5] x86: drop mfence in favor of lock+addl

From: Michael S. Tsirkin
Date: Wed Jan 27 2016 - 10:10:53 EST

mfence appears to be way slower than a locked instruction - let's use
lock+add unconditionally, as we always did on old 32-bit.

Just poking at SP would be the most natural, but if we
then read the value from SP, we get a false dependency
which will slow us down.

This was noted in this article:

And is easy to reproduce by sticking a barrier in a small non-inline

So let's use a negative offset - which avoids this problem since we
build with the red zone disabled.

Unfortunately there's some code that wants to order clflush instructions
using mb(), so we can't replace that - but smp_mb should be safe
to replace.

Update mb/rmb/wmb on 32 bit to use the negative offset, too, for

Suggested-by: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
arch/x86/include/asm/barrier.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index bfb28ca..7ab9581 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -11,11 +11,11 @@

#ifdef CONFIG_X86_32
-#define mb() asm volatile(ALTERNATIVE("lock; addl $0,0(%%esp)", "mfence", \
+#define mb() asm volatile(ALTERNATIVE("lock; addl $0,-4(%%esp)", "mfence", \
X86_FEATURE_XMM2) ::: "memory", "cc")
-#define rmb() asm volatile(ALTERNATIVE("lock; addl $0,0(%%esp)", "lfence", \
+#define rmb() asm volatile(ALTERNATIVE("lock; addl $0,-4(%%esp)", "lfence", \
X86_FEATURE_XMM2) ::: "memory", "cc")
-#define wmb() asm volatile(ALTERNATIVE("lock; addl $0,0(%%esp)", "sfence", \
+#define wmb() asm volatile(ALTERNATIVE("lock; addl $0,-4(%%esp)", "sfence", \
X86_FEATURE_XMM2) ::: "memory", "cc")
#define mb() asm volatile("mfence":::"memory")
@@ -30,7 +30,7 @@
#define dma_wmb() barrier()

-#define __smp_mb() mb()
+#define __smp_mb() asm volatile("lock; addl $0,-4(%%esp)" ::: "memory", "cc")
#define __smp_rmb() dma_rmb()
#define __smp_wmb() barrier()
#define __smp_store_mb(var, value) do { (void)xchg(&var, value); } while (0)