[PATCH tip/locking/core v10 0/7] locking/qspinlock: Enhance qspinlock & pvqspinlock performance

From: Waiman Long
Date: Mon Nov 09 2015 - 19:09:58 EST

- Broke patch 2 into two separated patches (suggested by PeterZ).
- Changed the slowpath statistical counter code to use back debugfs
while keeping the per-cpu counter setup.
- Some minor twists and additional comments for the lock stealing
and adaptive spinning patches.

- Added a new patch 2 which tried to prefetch the cacheline of the
next MCS node in order to reduce the MCS unlock latency when it
was time to do the unlock.
- Changed the slowpath statistical counters implementation in patch
4 from atomic_t to per-cpu variables to reduce performance overhead
and used sysfs instead of debugfs to return the consolidated counts
and data.

- Annotated the use of each _acquire/_release variants in qspinlock.c.
- Used the available pending bit in the lock stealing patch to disable
lock stealing when the queue head vCPU is actively spinning on the
lock to avoid lock starvation.
- Restructured the lock stealing patch to reduce code duplication.
- Verified that the waitcnt processing will be compiled away if
QUEUED_LOCK_STAT isn't enabled.

- Removed arch/x86/include/asm/qspinlock.h from patch 1.
- Removed the unconditional PV kick patch as it has been merged
into tip.
- Changed the pvstat_inc() API to add a new condition parameter.
- Added comments and rearrange code in patch 4 to clarify where
lock stealing happened.
- In patch 5, removed the check for pv_wait count when deciding when
to wait early.
- Updated copyrights and email address.

- Added a new patch 1 to relax the cmpxchg and xchg operations in
the native code path to reduce performance overhead on non-x86
- Updated the unconditional PV kick patch as suggested by PeterZ.
- Added a new patch to allow one lock stealing attempt at slowpath
entry point to reduce performance penalty due to lock waiter
- Removed the pending bit and kick-ahead patches as they didn't show
any noticeable performance improvement on top of the lock stealing
- Simplified the adaptive spinning patch as the lock stealing patch
allows more aggressive pv_wait() without much performance penalty
in non-overcommitted VMs.

- Rebased the patch to the latest tip tree.
- Corrected the comments and commit log for patch 1.
- Removed the v4 patch 5 as PV kick deferment is no longer needed with
the new tip tree.
- Simplified the adaptive spinning patch (patch 6) & improve its
performance a bit further.
- Re-ran the benchmark test with the new patch.

- Patch 1: add comment about possible racing condition in PV unlock.
- Patch 2: simplified the pv_pending_lock() function as suggested by
- Move PV unlock optimization patch forward to patch 4 & rerun
performance test.

- Moved deferred kicking enablement patch forward & move back
the kick-ahead patch to make the effect of kick-ahead more visible.
- Reworked patch 6 to make it more readable.
- Reverted back to use state as a tri-state variable instead of
adding an additional bistate variable.
- Added performance data for different values of PV_KICK_AHEAD_MAX.
- Add a new patch to optimize PV unlock code path performance.

- Take out the queued unfair lock patches
- Add a patch to simplify the PV unlock code
- Move pending bit and statistics collection patches to the front
- Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
when appropriate.
- Change the wait-early patch to use adaptive spinning to better
balance the difference effect on normal and over-committed guests.
- Add patch-to-patch performance changes in the patch commit logs.

This patchset tries to improve the performance of both regular and
over-commmitted VM guests. The adaptive spinning patch was inspired
by the "Do Virtual Machines Really Scale?" blog from Sanidhya Kashyap.

Patch 1 relaxes the memory order restriction of atomic operations by
using less restrictive _acquire and _release variants of cmpxchg()
and xchg(). This will reduce performance overhead when ported to other
non-x86 architectures.

Patch 2 attempts to prefetch the cacheline of the next MCS node to
reduce latency in the MCS unlock operation.

Patch 3 removes a redundant read of the next pointer.

Patch 4 optimizes the PV unlock code path performance for x86-64

Patch 5 allows the collection of various slowpath statistics counter
data that are useful to see what is happening in the system. Per-cpu
counters are used to minimize performance overhead.

Patch 6 allows one lock stealing attempt at slowpath entry. This causes
a pretty big performance improvement for over-committed VM guests.

Patch 7 enables adaptive spinning in the queue nodes. This patch
leads to further performance improvement in over-committed guest,
though it is not as big as the previous patch.

Waiman Long (7):
locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg
locking/qspinlock: prefetch next node cacheline
locking/qspinlock: Avoid redundant read of next pointer
locking/pvqspinlock, x86: Optimize PV unlock code path
locking/pvqspinlock: Collect slowpath lock statistics
locking/pvqspinlock: Allow limited lock stealing
locking/pvqspinlock: Queue node adaptive spinning

arch/x86/Kconfig | 8 +
arch/x86/include/asm/qspinlock_paravirt.h | 59 ++++++
include/asm-generic/qspinlock.h | 9 +-
kernel/locking/qspinlock.c | 90 +++++++--
kernel/locking/qspinlock_paravirt.h | 252 +++++++++++++++++++++----
kernel/locking/qspinlock_stat.h | 293 +++++++++++++++++++++++++++++
6 files changed, 648 insertions(+), 63 deletions(-)
create mode 100644 kernel/locking/qspinlock_stat.h

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/