Re: [patch] increase spinlock-debug looping timeouts (write_lockand NMI)

From: Nick Piggin
Date: Tue Jun 20 2006 - 09:48:21 EST

Next message: Vasily Averin: "Re: [PATCH 1/1] scsi : megaraid_{mm,mbox}: a fix on 64-bit DMA capabilitycheck"
Previous message: Edgar Hucek: "[PATCH 1/1] New Framebuffer for Intel based Macs"
In reply to: Arjan van de Ven: "Re: update pci device id"
Next in thread: Arjan van de Ven: "Re: [patch] increase spinlock-debug looping timeouts (write_lockand NMI)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Arjan van de Ven wrote:

Correct me if I'm wrong, but... a read-lock requires at most a single
cacheline transfer per lock acq and a single per release, no matter the
concurrency on the lock (so long as it is read only).

A spinlock is going to take more. If the hardware perfectly round-robins
the cacheline, it will take lockers+1 transfers per lock+unlock.

This is a bit too simplistic view; shared cachelines are cheap, it's
getting the cacheline exclusive (or transitioning to/from exclusive)
that is the expensive part...

Taking the lock is going to transiation the cacheline to exclusive. If
the next locker tries to take the lock, they transfer the cacheline and
exclusive access and fail. If they have already tried to take the lock
earlier, they might only request a readonly state, but it still requires
a cacheline transfer (which is the expensive part).

The only way it is simplistic is that hardware will be unfair and give
the same, or "close" requesters priority for some time, so the cacheline
stays close.

At some point, when it gets transferred away, there is no guarantee that
the spinlock will be unlocked. Quite likely the opposite, if there is
large contention for it and/or its cacheline.

(note that our spinlocks are fixed nowadays to only do the slowpath side
of things for read, eg allow shared cachelines there)

To put it another way, when 1 CPU takes or releases the lock, the cachelines
of 11 others are invalidated. In a perfect round-robin, if 12 queue up at the
same time, 1 will go through and 11 will fail (= 12 cacheline transfers). So
in this situation, the reader lock has a factor of 12 better acquisition
throughput.

Now the situation is simplistic (all queueing at the same time, perfectly
fair hardware), but the cacheline transfer costs are accurate *for this
situation*.

So I think rwlocks do have a fundamental advantage over spinlocks (aside
from the multiple concurrent readers advantage, although the two properties
are obviously fundamentally related). It is yet to be shown whether that is
actually the cause of Peter's performance improvement, but that is my
guess.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com -
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Vasily Averin: "Re: [PATCH 1/1] scsi : megaraid_{mm,mbox}: a fix on 64-bit DMA capabilitycheck"
Previous message: Edgar Hucek: "[PATCH 1/1] New Framebuffer for Intel based Macs"
In reply to: Arjan van de Ven: "Re: update pci device id"
Next in thread: Arjan van de Ven: "Re: [patch] increase spinlock-debug looping timeouts (write_lockand NMI)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]