[PATCH v4 0/5] x86: faster smp_mb()+documentation tweaks

From: Michael S. Tsirkin
Date: Wed Jan 27 2016 - 10:10:24 EST

mb() typically uses mfence on modern x86, but a micro-benchmark shows that it's
2 to 3 times slower than lock; addl that we use on older CPUs.

So we really should use the locked variant everywhere, except that intel manual
says that clflush is only ordered by mfence, so we can't.
Note: some callers of clflush seems to assume sfence will
order it, so there could be existing bugs around this code.

Fortunately no callers of clflush (except one) order it using smp_mb(), so
after fixing that one caller, it seems safe to override smp_mb straight away.

Down the road, it might make sense to introduce clflush_mb() and switch
to that for clflush callers.

While I was at it, I found some inconsistencies in comments in

The documentation fixes are included first - I verified that
they do not change the generated code at all. Borislav Petkov
said they will appear in tip eventually, included here for

The last patch changes __smp_mb() to lock addl. I was unable to
measure a speed difference on a macro benchmark,
but I noted that even doing
#define mb() barrier()
seems to make no difference for most benchmarks
(it causes hangs sometimes, of course).

HPA asked that the last patch is deferred until we hear back from
intel, which makes sense of course. So it needs HPA's ack.

Changes from v3:
Leave mb() alone for now since it's used to order
clflush, which requires mfence. Optimize smp_mb instead.

Changes from v2:
add patch adding cc clobber for addl
tweak commit log for patch 2
use addl at SP-4 (as opposed to SP) to reduce data dependencies

Michael S. Tsirkin (5):
x86: add cc clobber for addl
x86: drop a comment left over from X86_OOSTORE
x86: tweak the comment about use of wmb for IO
x86: use mb() around clflush
x86: drop mfence in favor of lock+addl

arch/x86/include/asm/barrier.h | 17 ++++++++---------
arch/x86/kernel/process.c | 4 ++--
2 files changed, 10 insertions(+), 11 deletions(-)