Re: [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case

From: Waiman Long
Date: Thu Nov 18 2021 - 13:12:19 EST


On 11/18/21 07:57, Peter Zijlstra wrote:
On Thu, Nov 18, 2021 at 05:44:55PM +0800, Muchun Song wrote:

By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:
What kind of x86_64 ?

Before Patch After Patch
# of Threads real real reduced by
------------ ------ ------ ----------
1 65,373 65,206 ~0.0%
4 15,467 15,378 ~0.5%
40 6,214 5,528 ~11.0%

For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.

Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
---
kernel/locking/rwsem.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c51387a43265..ef2b2a3f508c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1249,17 +1249,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
- /*
- * Optimize for the case when the rwsem is not locked at all.
- */
- tmp = RWSEM_UNLOCKED_VALUE;
- do {
+ tmp = atomic_long_read(&sem->count);
+ while (!(tmp & RWSEM_READ_FAILED_MASK)) {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
- tmp + RWSEM_READER_BIAS)) {
+ tmp + RWSEM_READER_BIAS)) {
rwsem_set_reader_owned(sem);
return 1;
}
- } while (!(tmp & RWSEM_READ_FAILED_MASK));
+ }
return 0;
}
This is weird... so the only difference is that leading load, but given
contention you'd expect that load to stall, also, given it's a
non-exclusive load, to get stolen by a competing CPU. Whereas the old
code would start with a cmpxchg, which obviously will also stall, but
does an exclusive load.

And the thinking is that the exclusive load and the presence of the
cmpxchg loop would keep the line on that CPU for a little while and
progress is made.

Clearly this isn't working as expected. Also I suppose it would need
wider testing...

For a contended case, doing a shared read first doing an exclusive cmpxchg can certainly help to reduce cacheline trashing. I have no objection to making this change.

I believe most of the other trylock functions do a read first before doing an atomic operation. In essence, we assume the use of trylock means the callers are expecting an contended lock whereas callers of regular *lock() function are expecting an uncontended lock.

Acked-by: Waiman Long <longman@xxxxxxxxxx>

-Longman