Re: [PATCH RFC 1/2] qrwlock: A queue read/write lockimplementation

From: Thomas Gleixner
Date: Mon Jul 15 2013 - 18:32:08 EST


On Fri, 12 Jul 2013, Waiman Long wrote:

> This patch introduces a read/write lock implementation that put waiting
> readers and writers into a queue instead of actively contending the
> lock like the regular read/write lock. This will improve performance in
> highly contended situation by reducing the cache line bouncing effect.
>
> In addition, the queue read/write lock is more deterministic even
> though there is still a smaller chance for lock stealing if the reader
> or writer comes at the right moment. Other than that, lock granting
> is done in a FIFO manner. The only downside is the size increase in
> the lock structure by 4 bytes for 32-bit systems and by 12 bytes for
> 64-bit systems.
>
> This patch allows the replacement of architecture specific
> implementation of read/write lock by this generic version of queue
> read/write lock. Two new config parameters are introduced:
>
> 1. QUEUE_RWLOCK
> A select-able option that enables the building and replacement of
> architecture specific read/write lock.
> 2. ARCH_QUEUE_RWLOCK
> Have to be defined in arch/$(arch)/Kconfig to enable QUEUE_RWLOCK
>
> In term of single-thread performance (no contention), a 256K
> lock/unlock loop was run on a 2.4GHz Westmere x86-64 CPU. The following
> table shows the average time for a single lock/unlock sequence:
>
> Lock Type Time (ns)
> --------- ---------
> Ticket spinlock 15.7
> Read lock 17.0
> Write lock 17.2
> Queue read lock 31.1
> Queue write lock 13.6
>
> While the queue read lock is almost double the time of a read lock
> or spinlock, the queue write lock is the fastest of them all. The
> execution time can probably be reduced a bit by allowing inlining of
> the lock fast paths like the other locks.

So you have the writer side faster than the ticket lock, but what
about the reader side? And what about the contended case? What about
high frequent readers with low frequent writers cases?

> To see how the queue write lock can be used as a replacement for ticket
> spinlock (just like rwsem can be used as replacement of mutex), the
> mb_cache_spinlock in fs/mbcache.c, which is a bottleneck in the disk
> workload (ext4 FS) of the AIM7 benchmark, was converted to both a queue
> write lock and a regular write lock. When running on a 8-socket 80-core
> DL980 system, the performance improvement was shown in the table below.
>
> +-----------------+------------+------------+-----------+---------+
> | Configuration | Mean JPM | Mean JPM | Mean JPM | qrwlock |
> | |Vanilla 3.10|3.10-qrwlock|3.10-rwlock| %Change |
> +-----------------+-----------------------------------------------+
> | | User Range 10 - 100 |
> +-----------------+-----------------------------------------------+
> | 8 nodes, HT off | 441374 | 532774 | 637205 | +20.7% |
> | 8 nodes, HT on | 449373 | 584387 | 641965 | +30.1% |
> +-----------------+-----------------------------------------------+
> | | User Range 200 - 1000 |
> +-----------------+-----------------------------------------------+
> | 8 nodes, HT off | 226870 | 354490 | 371593 | +56.3% |
> | 8 nodes, HT on | 205997 | 314891 | 306378 | +52.9% |
> +-----------------+-----------------------------------------------+
> | | User Range 1100 - 2000 |
> +-----------------+-----------------------------------------------+
> | 8 nodes, HT off | 208234 | 321420 | 343694 | +54.4% |
> | 8 nodes, HT on | 199612 | 297853 | 252569 | +49.2% |
> +-----------------+------------+------------+-----------+---------+
>
> Apparently, the regular read/write lock performs even better than
> the queue read/write lock in some cases. This is probably due to the

The regular rwlock performs better in most cases. This is the full
list comparing both against the ticket lock.

qrlock rwlock
+20.7 +44.4
+30.1 +42.9

+56.3 +63.3
+52.9 +48.8

+54.4 +65.1
+49.2 +26.5

So you try to sell that qrwlock as a replacement for ticket spinlocks,
while at the same time you omit the fact that we have an even better
implementation (except for the last test case) already in the
kernel. What's the point of this exercise?

> fact that mb_cache_spinlock is in a separate cache line from the data
> being manipulated.

This is not about probably. What has the fact that the spinlock is in
a different cache line to do with the replacement by a different
implementation? Exactly nothing.

All three candidates ticket spinlocks, qrwlocks and rwlocks are in a
different cache line than the data which is protected by it.

So what makes the performance difference between the three
implementations? Definitely not the cache line association as that is
the same for all implementations. And just for the record, we have
tools to measure that.

We really need proper explanations and not some guess based drivel to
assess that.

Further down you write in the code:

> + * Compared with regular read/write lock, the queue read/write lock has
> + * has the following advantages:
> + * 1. It is more deterministic. Even though there is a slight chance of

Why is it more deterministic than the existing implementation?

> + * stealing the lock if come at the right moment, the granting of the
> + * lock is mostly in FIFO order.

> + * 2. It is faster in high contention situation.

Again, why is it faster?

> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
> + * and 12 bytes larger in 64-bit systems.
> + *
> + * There are two queues for writers. The writer field of the lock is a
> + * one-slot waiting queue. The writers that follows will have to wait
> + * in the combined reader/writer queue (waitq).
> + *
> + * Compared with x86 ticket spinlock, the queue read/write lock is faster
> + * in high contention situation. The writer lock is also faster in single
> + * thread operations. Therefore, queue read/write lock can be considered as
> + * a replacement for those spinlocks that are highly contended as long as
> + * an increase in lock size is not an issue.

So is it faster in the general case or only for the high contention or
single thread operation cases?

And you still miss to explain WHY it is faster. Can you please explain
proper WHY it is faster and WHY we can't apply that technique you
implemented for qrwlocks to writer only locks (aka spinlocks) with a
smaller lock size?

> Looking at the fserver and new_fserver workloads (ext4 FS) where the
> journal->j_state_lock (a read/write lock) is a bottleneck especially
> when HT is on, we see a slight different story. The j_state_lock is
> an embedded read/write lock which is in the same cacheline as some
> of the data being manipulated. The replacement by a queue read/write
> lock gives the following improvement.

Again, the fact that the lock is embedded and in the same cache line
is not explaining anything here. Especially not why this is only
relevant when HT is enabled. Please provide profound arguments backed
by solid data collected with the proper tools.

> +--------------------+-------------+--------------+---------------+
> | Workload |mean % change|mean % change |mean % change |
> | |10-100 users |200-1000 users|1100-2000 users|
> +--------------------+-------------+--------------+---------------+
> |fserver (HT off) | +0.3% | -0.1% | +1.9% |
> |fserver (HT on) | -0.1% | +32.2% | +34.7% |
> |new_fserver (HT on) | +0.8% | +0.9% | +0.9% |
> |new_fserver (HT off)| -1.2% | +29.8% | +40.5% |
> +--------------------+-------------+--------------+---------------+

So fserver gets a 34% improvement for HT ON, while new_fserver gets a
40% improvement for HT OFF. Any useful explanations except that the HT
on/off annotations are swapped?

Without explanations this is completely useless, really.

And again we want to see the behaviour on machines which do not have a
gazillion of sockets before we enable this unconditionally, as you try
to do in the next patch. Just get it: 8 socket machines are nice toys,
but 99% of the users do not have one.

Aside of that, you are replacing all RW locks unconditionally by this
new fangled thing, but did you actually run tests which look at other
rwlock usage sites than the particular one you care about?

I really doubt that. Why? Because it depends on the workload. And
fserver triggers high frequency writers on the j_state_lock. But
there are enough usage sites of rwlocks where the writer frequency is
extremly low. So how does the almost doubled reader time affect those
users of rwlocks?

And reading further down your patch, I can confirm my suspicion.

> +void queue_write_lock(struct qrwlock *lock)
> +{
....
> + /*
> + * At the head of the wait queue now, wait until no writer is pending
> + * and then atomically set it again.
> + */

You are optimizing for the high frequency writer case. And that's not
the primary use case for rwlocks. That's the special use case for the
jbd2 journal_state_lock which CANNOT be generalized for all other
rwlock usage sites.

So how can you justify:

> +config QUEUE_RWLOCK
> + bool "Generic queue read/write lock"
> + depends on ARCH_QUEUE_RWLOCK
> + help
> + Use a NUMA optimized queue read/write lock implementation. This
> + improves performance under lock contention on systems with more
> + than two sockets.
> +

And in Patch 2/2:

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2344,6 +2344,9 @@ config X86_DMA_REMAP
bool
depends on STA2X11

+config ARCH_QUEUE_RWLOCK
+ def_bool y
+

It's not justified at all, unless you provide enough proof that it
does not hurt workloads which have totally different rwlock usage
patterns. And proof means numbers, not some handwaving about probably
related to cachelines or whatever.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/