2015-10-20 1:18 GMT+08:00 Waiman Long<waiman.long@xxxxxxx>:
On 10/18/2015 10:27 PM, ling.ma.program@xxxxxxxxx wrote:Waiman,
From: Ma Ling<ling.ml@xxxxxxxxxxxxxxx>
All load instructions can run speculatively but they have to follow
memory order rule in multiple cores as below:
_x = _y = 0
Processor 0 Processor 1
mov r1, [ _y] //M1 mov [ _x], 1 //M3
mov r2, [ _x] //M2 mov [ _y], 1 //M4
If r1 = 1, r2 must be 1
In order to guarantee above rule, although Processor 0 execute
M1 and M2 instruction out of order, they are kept in ROB,
when load buffer for _x in Processor 0 received the update
message from Processor 1, Processor 0 need to roll back
from M2 instruction, which will flush the whole pipeline,
the latency is over the penalty from branch prediction miss.
In this patch we use lock cmpxchg instruction to force load
instructions to be serialization, the destination operand
receives a write cycle without regard to the result of
the comparison, which can help us to reduce the penalty
from load instruction roll back.
Our experiment indicates the performance can be improved by 10%~15%
for 2 and 3 threads cases, the conflicts from lock cache line
spend them most of the time.
What kind of performance test were you running? With the right timing, it is
possible that you see some performance gain. However, if the lock hold time
is longer so that a fair number of cmpxchg instructions have to be executed
before it can get the lock, you may see a performance degradation especially
if the lock holder needs to access the lock cacheline.
In general, we try to avoid this kind of cmpxchg loop unless we are sure
that at most a few iterations of the loop may happen.
The machine is Haswell (2699 V3, COD off, HT on, 2 sockets)
(we have sent test module in separate email)
A. Data is located with lock in one cache line On 2 threads cases
(only write struct member data_a)
1. Load version test 5 times, the cost time is below:
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 103904620
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 104351876
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 118599784
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 103064024
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 103389696
Totally cost time is 533310000
2. Lock cmpxchg version test 5 times, the cost time is below:
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 67081220
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 97640708
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 96439612
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 66699296
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 96464800
Totally cost time is 424325636
Above data shows lock cmpxchg is better about average 25% (533310000/424325636)
B. Data is located with lock in different cache line On 2 threads
cases(only write struct member data_b)
1. Load version test 5 times, the cost time is below:
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 174266128
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 205053924
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 160165124
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 173241552
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 205765008
Totally cost time is 918491736
2. Lock cmpxchg version test 5 times, the cost time is below:
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 113410044
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 116293104
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 116064256
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 189320876
[root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c
all cost time is 123735352
Totally cost time is 658823632
Above data shows lock cmpxchg is better about average 39% (918491736/658823632)