Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

From: Waiman Long
Date: Mon Oct 19 2015 - 13:24:34 EST

Next message: Alexander Holler: "Re: [PATCH 03/14] init: deps: dt: use (HW-specific) dependencies provided by the DT too"
Previous message: William Dauchy: "Re: [PATCH] fs: fix data races on inode->i_flctx"
In reply to: Ingo Molnar: "Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback"
Next in thread: Ling Ma: "Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/19/2015 07:24 AM, Ingo Molnar wrote:

* Peter Zijlstra<peterz@xxxxxxxxxxxxx> wrote:

On Mon, Oct 19, 2015 at 09:58:23AM +0200, Ingo Molnar wrote:

* ling.ma.program@xxxxxxxxx<ling.ma.program@xxxxxxxxx> wrote:

From: Ma Ling<ling.ml@xxxxxxxxxxxxxxx>

All load instructions can run speculatively but they have to follow
memory order rule in multiple cores as below:
_x = _y = 0

Processor 0 Processor 1

mov r1, [ _y] //M1 mov [ _x], 1 //M3
mov r2, [ _x] //M2 mov [ _y], 1 //M4

If r1 = 1, r2 must be 1

In order to guarantee above rule, although Processor 0 execute
M1 and M2 instruction out of order, they are kept in ROB,
when load buffer for _x in Processor 0 received the update
message from Processor 1, Processor 0 need to roll back
from M2 instruction, which will flush the whole pipeline,
the latency is over the penalty from branch prediction miss.

In this patch we use lock cmpxchg instruction to force load
instructions to be serialization, the destination operand
receives a write cycle without regard to the result of
the comparison, which can help us to reduce the penalty
from load instruction roll back.

Our experiment indicates the performance can be improved by 10%~15%
for 2 and 3 threads cases, the conflicts from lock cache line
spend them most of the time.

So it would be nice to create a new user-space spinlock testing facility, via a
new 'perf bench spinlock' feature or so. That way others can test and validate
your results on different hardware as well.

So its trivial to lift this code into userspace -- in fact, I have that
somewhere.

The trouble is going to keep them in sync.

So we can just try this optimistically, and if it keeps breaking, we can use the
technique perf uses to sync up the rbtree implementation: we copy the kernel
version into tooling, but run diff against the kernel version and warn at tool
build time that there's divergence.

I.e. a non-build-fatal force that keeps things in sync.

Thanks,

Ingo

It is on my to-do list. I just want to wrap up my latest PV qspinlock patch before embarking on this adventure.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Alexander Holler: "Re: [PATCH 03/14] init: deps: dt: use (HW-specific) dependencies provided by the DT too"
Previous message: William Dauchy: "Re: [PATCH] fs: fix data races on inode->i_flctx"
In reply to: Ingo Molnar: "Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback"
Next in thread: Ling Ma: "Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]