Re: [PATCH v2 0/5] Switch arm64 over to qrwlock

From: Jeremy Linton
Date: Mon Oct 09 2017 - 18:31:33 EST


Hi,

On 10/06/2017 08:34 AM, Will Deacon wrote:
Hi all,

This is version two of the patches I posted yesterday:

http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html

I'd normally leave it longer before posting again, but Peter had a good
suggestion to rework the layout of the lock word, so I wanted to post a
version that follows that approach.

I've updated my branch if you're after the full patch stack:

git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock

As before, all comments (particularly related to testing and performance)
welcome!

I've been doing perf comparisons with the rwlock fairness patch I posted last week on a single socket thunderx and the baseline rwlock. For most cases where the ratio of read/writers is similar (uncontended readers/writers, single writer/lots readers, etc) the absolute number of lock acquisitions is very similar (within a few percent one way or the other).

In this regard both patches are light years ahead of the current arm64 rwlock. The unfairness of the current rwlock allows significantly higher lock acquisition counts (say 4x at 30Readers:1Writer) at the expense of complete writer starvation (or a ~43k:1 ratio at a 30R:1W per locktorture). This is untenable.

The qrwlock does an excellent job of maintaining the ratio of reader/writer acquisitions proportional to the number of readers/writers until the total lockers exceeds the number of cores where the ratios start to far exceed the reader/writer ratios (440:1 acquisitions @ 96R:1W)

In comparison the other patch tends to favor the writers more, so at a ratio of 48R/1W, the readers are only grabbing the lock at a ratio of 15:1. This flatter curve continues past the number of cores with the readers having a 48:1 advantage with 96R/1W. That said the total lock acquisition counts remain very similar (with maybe a slight advantage to the non queued patch with 1 writer and 12-30 readers) despite the writer acquiring the lock at a higher frequency. OTOH, if the number of writers is closer to the number of readers (24R:24W) then the writers have about a 1.5X bias over the readers independent of the number of reader/writers. This bias seems to be the common multiplier given a reader/writer ratio with those patches and doesn't account for possible single thread starvation.

Of course, I've been running other tests as well and the system seems to be behaving as expected (likely better than the rwlock patches under stress). I will continue to test this on a couple other platforms.

In the meantime:

Tested-by: Jeremy Linton <jeremy.linton@xxxxxxx>



Cheers,

Will

--->8

Will Deacon (5):
kernel/locking: Use struct qrwlock instead of struct __qrwlock
locking/atomic: Add atomic_cond_read_acquire
kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock
arm64: locking: Move rwlock implementation over to qrwlocks
kernel/locking: Prevent slowpath writers getting held up by fastpath

arch/arm64/Kconfig | 17 ++++
arch/arm64/include/asm/Kbuild | 1 +
arch/arm64/include/asm/spinlock.h | 164 +-------------------------------
arch/arm64/include/asm/spinlock_types.h | 6 +-
include/asm-generic/atomic-long.h | 3 +
include/asm-generic/qrwlock.h | 20 +---
include/asm-generic/qrwlock_types.h | 15 ++-
include/linux/atomic.h | 4 +
kernel/locking/qrwlock.c | 83 +++-------------
9 files changed, 58 insertions(+), 255 deletions(-)