Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

From: Waiman Long
Date: Tue Oct 20 2015 - 14:55:42 EST


On 10/19/2015 11:12 PM, Ling Ma wrote:
2015-10-20 1:18 GMT+08:00 Waiman Long<waiman.long@xxxxxxx>:
On 10/18/2015 10:27 PM, ling.ma.program@xxxxxxxxx wrote:
From: Ma Ling<ling.ml@xxxxxxxxxxxxxxx>

All load instructions can run speculatively but they have to follow
memory order rule in multiple cores as below:
_x = _y = 0

Processor 0 Processor 1

mov r1, [ _y] //M1 mov [ _x], 1 //M3
mov r2, [ _x] //M2 mov [ _y], 1 //M4

If r1 = 1, r2 must be 1

In order to guarantee above rule, although Processor 0 execute
M1 and M2 instruction out of order, they are kept in ROB,
when load buffer for _x in Processor 0 received the update
message from Processor 1, Processor 0 need to roll back
from M2 instruction, which will flush the whole pipeline,
the latency is over the penalty from branch prediction miss.

In this patch we use lock cmpxchg instruction to force load
instructions to be serialization, the destination operand
receives a write cycle without regard to the result of
the comparison, which can help us to reduce the penalty
from load instruction roll back.

Our experiment indicates the performance can be improved by 10%~15%
for 2 and 3 threads cases, the conflicts from lock cache line
spend them most of the time.

What kind of performance test were you running? With the right timing, it is
possible that you see some performance gain. However, if the lock hold time
is longer so that a fair number of cmpxchg instructions have to be executed
before it can get the lock, you may see a performance degradation especially
if the lock holder needs to access the lock cacheline.

In general, we try to avoid this kind of cmpxchg loop unless we are sure
that at most a few iterations of the loop may happen.
Waiman,

The machine is Haswell (2699 V3, COD off, HT on, 2 sockets)
(we have sent test module in separate email)



A. Data is located with lock in one cache line On 2 threads cases
(only write struct member data_a)

1. Load version test 5 times, the cost time is below:


[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 103904620

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 104351876

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 118599784

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 103064024

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 103389696

Totally cost time is 533310000

2. Lock cmpxchg version test 5 times, the cost time is below:

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 67081220

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 97640708

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 96439612

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 66699296

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 96464800



Totally cost time is 424325636



Above data shows lock cmpxchg is better about average 25% (533310000/424325636)



B. Data is located with lock in different cache line On 2 threads
cases(only write struct member data_b)



1. Load version test 5 times, the cost time is below:



[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 174266128

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 205053924

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 160165124

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 173241552

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 205765008

Totally cost time is 918491736



2. Lock cmpxchg version test 5 times, the cost time is below:

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 113410044

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 116293104

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 116064256

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 189320876

[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c



all cost time is 123735352

Totally cost time is 658823632



Above data shows lock cmpxchg is better about average 39% (918491736/658823632)



I did see some performance improvement when I used your test program on a Haswell-EX system. It seems like the use of cmpxchg has forced the changed memory values to be visible to other processors earlier. I also ran your test on an older machine with Westmere-EX processors. This time, I didn't see any performance improvement. In fact, your change actually make it a tiny bit slower. So the benefit of your patch can be highly processor sensitive.

As other architectures like ARM & AA64 are going to adopt qspinlock in the near future, we will also need to make sure that it won't cause a regression there. So I don't see your patch has a big chance of being merged upstream unless you can provide a real world workload that can benefit from your patch. Even then, proving that it won't cause regression in other processors or architectures can be tedious.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/