Re: [patch] increase spinlock-debug looping timeouts (write_lockand NMI)

From: Nick Piggin
Date: Tue Jun 20 2006 - 06:58:02 EST


[Corrected Arjan's address I messed up earlier.]

Ingo Molnar wrote:
* Nick Piggin <nickpiggin@xxxxxxxxxxxx> wrote:


Ingo Molnar wrote:

curious, do you have any (relatively-) simple to run testcase that clearly shows the "scalability issues" you mention above, when going

from rwlocks to spinlocks? I'd like to give it a try on an 8-way box.

Arjan van de Ven wrote:

I'm curious what scalability advantage you see for rw spinlocks vs real
spinlocks ... since for any kind of moderate hold time the opposite is
expected ;)

It actually surprised me too, but Peter Chubb (who IIRC provided the motivation to merge the patch) showed some fairly significant improvement at 12-way.

https://www.gelato.unsw.edu.au/archives/scalability/2005-March/000069.html


i think that workload wasnt analyzed well enough (by us, not by Peter, who sent a reasonable analysis and suggested a reasonable change), and we went with whatever magic change appeared to make a difference, without fully understanding the underlying reasons. Quote:

"I'm not sure what's happening in the 4-processor case."

Now history appears to be repeating itself, just in the other direction ;) And we didnt get one inch closer to understanding the situation for real. I'd vote for putting a change-moratorium on tree-lock and only allow a patch that tweaks it that fully analyzes the workload :-)

one thing off the top of my mind: doesnt lockstat introduce significant overhead? Is this reproducable with lockstat turned off too? Is the same scalability problem visible if all read_lock()s are changed to write_lock()? [like i did in my patch] I.e. can other explanations (like unlucky alignment of certain rwlock data structures / functions) be excluded.

Yes, it would need re-testing.


another thing: average hold times in the spinlock case on that workload are below 1 microsecond - probably on the range of cachemiss bounce costs on such a system.

It's the wait time that I'd be more worried about. As I said, my wild
guess is that the wait times are creeping up.

I.e. it's the worst possible case for a spinlock->rwlock conversion! The only reason i can believe this to make a difference are cycle level races and small random micro-differences that cause heavier bouncing in the spinlock workload but happen to avoid it in the read-lock case. Not due to any fundamental advantage of rwlocks.

I'd say the 12 way results show that there is a fundamental advantage
(although that's pending whether or not lockstat is wrecking the results).
I'd even go out on a limb ;) and say that it will only become more
pronounced at higher cpu counts.

Correct me if I'm wrong, but... a read-lock requires at most a single
cacheline transfer per lock acq and a single per release, no matter the
concurrency on the lock (so long as it is read only).

A spinlock is going to take more. If the hardware perfectly round-robins
the cacheline, it will take lockers+1 transfers per lock+unlock. Of
course, hardware might be pretty unfair for efficiency, but there will
still be some probability of the cacheline bouncing to other lockers
while it is locked. And that probability will increase proportionally to
the number of lockers.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com -
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/