Re: [PATCH v3 1/2] qspinlock: Introducing a 4-byte queue spinlockimplementation

From: Waiman Long
Date: Fri Jan 31 2014 - 14:24:47 EST


On 01/31/2014 10:08 AM, Peter Zijlstra wrote:
On Tue, Jan 28, 2014 at 01:19:10PM -0500, Waiman Long wrote:
For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU. The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

Lock Type Time (ns)
--------- ---------
Ticket spinlock 14.1
Queue spinlock (Normal) 8.8*
What CONFIG_NR_CPUS ?

I was testing on a RHEL6.4 system which has a CONFIG_NR_CPUS of 4096.


Because for CONFIG_NR_CPUS< 128 (or 256 if you got !PARAVIRT), the fast
path code should be:

ticket:

mov $0x100,eax
lock xadd %ax,(%rbx)
cmp %al,%ah
jne ...

although my GCC is being silly and writes:

mov $0x100,eax
lock xadd %ax,(%rbx)
movzbl %ah,%edx
cmp %al,%dl
jne ...

Which seems rather like a waste of a perfectly good cycle.

With a bigger NR_CPUS you do indeed need more ops:

mov $0x10000,%edx
lock xadd %edx,(%rbx)
mov %edx,%ecx
shr $0x10,%ecx
cmp %dx,%cx
jne ...


Whereas for the straight cmpxchg() you'd get something relatively simple
like:

mov %edx,%eax
lock cmpxchg %ecx,(%rbx)
cmp %edx,%eax
jne ...

I believe the speeds of the lock functions are about the same. However, qspinlock has a much simpler unlock function which probably account of most of the speed gain.

Anyway, as soon as you get some (light) contention you're going to tank
because you have to pull in extra cachelines, which is sad.

Light contention is the only case where the qspinlock may not perform as good as the ticket spinlock. I know this is the most common case. However, I would argue that the slowdown, if any, will not be really noticeable. This is what I will try to find out.


I suppose we could from the ticket code more and optimize the
uncontended path, but that'll make the contended path more expensive
again, although probably not as bad as hitting a new cacheline.

I don't get what you are trying to say.

Right now, I am using only bit 0 as a lock bit. I can use bit 4, for instance, as a pending locker bit and spin until bit 0 is clear. So if there is only 1 other task spinning, it won't need to fetch another cacheline. However, it will slow down the uncontended path as I can't assign a 0 byte to free the lock. I have to use an atomic subtraction or clear bit instead.

-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/