* Waiman Long<waiman.long@xxxxxx> wrote:
Btw., it's possible to further optimize this "writer releases the lock toThe current rwlock implementation suffers from a thundering herd+ * stealing the lock if come at the right moment, the granting of theAgain, why is it faster?
+ * lock is mostly in FIFO order.
+ * 2. It is faster in high contention situation.
problem. When many readers are waiting for the lock hold by a writer,
they will all jump in more or less at the same time when the writer
releases the lock. That is not the case with qrwlock. It has been shown
in many cases that avoiding this thundering herd problem can lead to
better performance.
multiple readers spinning" thundering herd scenario in the classic
read_lock() case, without changing the queueing model.
Right now read_lock() fast path is a single atomic instruction. When a
writer releases the lock then it makes it available to all readers and
each reader will execute a LOCK DEC instruction which will succeed.
This is the relevant code in arch/x86/lib/rwlock.S [edited for
readability]:
__read_lock_failed():
0: LOCK_PREFIX
READ_LOCK_SIZE(inc) (%__lock_ptr)
1: rep; nop
READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
js 1b
LOCK_PREFIX READ_LOCK_SIZE(dec) (%__lock_ptr)
js 0b
ret
This is where we could optimize: instead of signalling to each reader that
it's fine to decrease the count and letting dozens of readers do that on
the same cache-line, which ping-pongs around the numa cross-connect
touching every other CPU as they execute the LOCK DEC instruction, we
could let the _writer_ modify the count on unlock in essence, to the exact
value that readers expect.
Since read_lock() can never abort this should be relatively
straightforward: the INC above could be left out, and the writer side
needs to detect that there are no other writers waiting and can set the
count to 'reader locked' value - which the readers will detect without
modifying the cache line:
__read_lock_failed():
0: rep; nop
READ_LOCK_SIZE(cmp) $1, (%__lock_ptr)
js 0b
ret
(Unless I'm missing something that is.)
That way the current write_unlock() followed by a 'thundering herd' of
__read_lock_failed() atomic accesses is transformed into an efficient
read-only broadcast of information with only a single update to the
cacheline: the writer-updated cacheline propagates in parallel to every
CPU and is cached there.
On typical hardware this will be broadcast to all CPUs as part of regular
MESI invalidation bus traffic.
reader unlock will still have to modify the cacheline, so rwlocks will
still have a fundamental scalability limit even in the read-only usecase.