On Fri, 12 Jul 2013, Waiman Long wrote:
In term of single-thread performance (no contention), a 256KSo you have the writer side faster than the ticket lock, but what
lock/unlock loop was run on a 2.4GHz Westmere x86-64 CPU. The following
table shows the average time for a single lock/unlock sequence:
Lock Type Time (ns)
--------- ---------
Ticket spinlock 15.7
Read lock 17.0
Write lock 17.2
Queue read lock 31.1
Queue write lock 13.6
While the queue read lock is almost double the time of a read lock
or spinlock, the queue write lock is the fastest of them all. The
execution time can probably be reduced a bit by allowing inlining of
the lock fast paths like the other locks.
about the reader side? And what about the contended case? What about
high frequent readers with low frequent writers cases?
To see how the queue write lock can be used as a replacement for ticketThe regular rwlock performs better in most cases. This is the full
spinlock (just like rwsem can be used as replacement of mutex), the
mb_cache_spinlock in fs/mbcache.c, which is a bottleneck in the disk
workload (ext4 FS) of the AIM7 benchmark, was converted to both a queue
write lock and a regular write lock. When running on a 8-socket 80-core
DL980 system, the performance improvement was shown in the table below.
+-----------------+------------+------------+-----------+---------+
| Configuration | Mean JPM | Mean JPM | Mean JPM | qrwlock |
| |Vanilla 3.10|3.10-qrwlock|3.10-rwlock| %Change |
+-----------------+-----------------------------------------------+
| | User Range 10 - 100 |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off | 441374 | 532774 | 637205 | +20.7% |
| 8 nodes, HT on | 449373 | 584387 | 641965 | +30.1% |
+-----------------+-----------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off | 226870 | 354490 | 371593 | +56.3% |
| 8 nodes, HT on | 205997 | 314891 | 306378 | +52.9% |
+-----------------+-----------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+-----------------------------------------------+
| 8 nodes, HT off | 208234 | 321420 | 343694 | +54.4% |
| 8 nodes, HT on | 199612 | 297853 | 252569 | +49.2% |
+-----------------+------------+------------+-----------+---------+
Apparently, the regular read/write lock performs even better than
the queue read/write lock in some cases. This is probably due to the
list comparing both against the ticket lock.
qrlock rwlock
+20.7 +44.4
+30.1 +42.9
+56.3 +63.3
+52.9 +48.8
+54.4 +65.1
+49.2 +26.5
So you try to sell that qrwlock as a replacement for ticket spinlocks,
while at the same time you omit the fact that we have an even better
implementation (except for the last test case) already in the
kernel. What's the point of this exercise?
fact that mb_cache_spinlock is in a separate cache line from the dataThis is not about probably. What has the fact that the spinlock is in
being manipulated.
a different cache line to do with the replacement by a different
implementation? Exactly nothing.
All three candidates ticket spinlocks, qrwlocks and rwlocks are in a
different cache line than the data which is protected by it.
So what makes the performance difference between the three
implementations? Definitely not the cache line association as that is
the same for all implementations. And just for the record, we have
tools to measure that.
We really need proper explanations and not some guess based drivel to
assess that.
Further down you write in the code:
+ * Compared with regular read/write lock, the queue read/write lock hasWhy is it more deterministic than the existing implementation?
+ * has the following advantages:
+ * 1. It is more deterministic. Even though there is a slight chance of
+ * stealing the lock if come at the right moment, the granting of theAgain, why is it faster?
+ * lock is mostly in FIFO order.
+ * 2. It is faster in high contention situation.
+ * The only downside is that the lock is 4 bytes larger in 32-bit systemsSo is it faster in the general case or only for the high contention or
+ * and 12 bytes larger in 64-bit systems.
+ *
+ * There are two queues for writers. The writer field of the lock is a
+ * one-slot waiting queue. The writers that follows will have to wait
+ * in the combined reader/writer queue (waitq).
+ *
+ * Compared with x86 ticket spinlock, the queue read/write lock is faster
+ * in high contention situation. The writer lock is also faster in single
+ * thread operations. Therefore, queue read/write lock can be considered as
+ * a replacement for those spinlocks that are highly contended as long as
+ * an increase in lock size is not an issue.
single thread operation cases?
And you still miss to explain WHY it is faster. Can you please explain
proper WHY it is faster and WHY we can't apply that technique you
implemented for qrwlocks to writer only locks (aka spinlocks) with a
smaller lock size?
Looking at the fserver and new_fserver workloads (ext4 FS) where theAgain, the fact that the lock is embedded and in the same cache line
journal->j_state_lock (a read/write lock) is a bottleneck especially
when HT is on, we see a slight different story. The j_state_lock is
an embedded read/write lock which is in the same cacheline as some
of the data being manipulated. The replacement by a queue read/write
lock gives the following improvement.
is not explaining anything here. Especially not why this is only
relevant when HT is enabled. Please provide profound arguments backed
by solid data collected with the proper tools.
+--------------------+-------------+--------------+---------------+So fserver gets a 34% improvement for HT ON, while new_fserver gets a
| Workload |mean % change|mean % change |mean % change |
| |10-100 users |200-1000 users|1100-2000 users|
+--------------------+-------------+--------------+---------------+
|fserver (HT off) | +0.3% | -0.1% | +1.9% |
|fserver (HT on) | -0.1% | +32.2% | +34.7% |
|new_fserver (HT on) | +0.8% | +0.9% | +0.9% |
|new_fserver (HT off)| -1.2% | +29.8% | +40.5% |
+--------------------+-------------+--------------+---------------+
40% improvement for HT OFF. Any useful explanations except that the HT
on/off annotations are swapped?
Without explanations this is completely useless, really.
And again we want to see the behaviour on machines which do not have a
gazillion of sockets before we enable this unconditionally, as you try
to do in the next patch. Just get it: 8 socket machines are nice toys,
but 99% of the users do not have one.
Aside of that, you are replacing all RW locks unconditionally by this
new fangled thing, but did you actually run tests which look at other
rwlock usage sites than the particular one you care about?
I really doubt that. Why? Because it depends on the workload. And
fserver triggers high frequency writers on the j_state_lock. But
there are enough usage sites of rwlocks where the writer frequency is
extremly low. So how does the almost doubled reader time affect those
users of rwlocks?
And reading further down your patch, I can confirm my suspicion.
+void queue_write_lock(struct qrwlock *lock)....
+{
+ /*You are optimizing for the high frequency writer case. And that's not
+ * At the head of the wait queue now, wait until no writer is pending
+ * and then atomically set it again.
+ */
the primary use case for rwlocks. That's the special use case for the
jbd2 journal_state_lock which CANNOT be generalized for all other
rwlock usage sites.