Re: [PATCH] locking/osq_lock: Avoid false sharing in optimistic_spin_node

From: Waiman Long
Date: Wed Dec 20 2023 - 11:58:37 EST



On 12/20/23 02:02, Zeng Heng wrote:
Using the UnixBench test suite, we clearly find that osq_lock() cause
extremely high overheads with perf tool in the File Copy items:

Overhead Shared Object Symbol
94.25% [kernel] [k] osq_lock
0.74% [kernel] [k] rwsem_spin_on_owner
0.32% [kernel] [k] filemap_get_read_batch

In response to this, we conducted an analysis and made some gains:

In the prologue of osq_lock(), it set `cpu` member of percpu struct
optimistic_spin_node with the local cpu id, after that the value of the
percpu struct would never change in fact. Based on that, we can regard
the `cpu` member as a constant variable.

In the meanwhile, other members of the percpu struct like next, prev and
locked are frequently modified by osq_lock() and osq_unlock() which are
called by rwsem, mutex and so on. However, that would invalidate the cache
of the cpu member on other CPUs.

Therefore, we can place padding here and split them into different cache
lines to avoid cache misses when the next CPU is spinning to check other
node's cpu member by vcpu_is_preempted().

Here provide the UnixBench full-core test result based on v6.6 as below:
Machine Intel(R) Xeon(R) Gold 6248 CPU, 40 cores, 80 threads
Run the command of "./Run -c 80 -i 3" over 20 times and take the average.

System Benchmarks Index Values Without Patch With Patch Diff
Dhrystone 2 using register variables 185518.38 185329.56 -0.10%
Double-Precision Whetstone 79330.46 79268.22 -0.08%
Execl Throughput 9725.14 10390.18 6.84%
File Copy 1024 bufsize 2000 maxblocks 1658.42 2035.55 22.74%
File Copy 256 bufsize 500 maxblocks 1086.54 1316.96 21.21%
File Copy 4096 bufsize 8000 maxblocks 3610.42 4152.79 15.02%
Pipe Throughput 69325.18 69913.85 0.85%
Pipe-based Context Switching 14026.32 14703.07 4.82%
Process Creation 8329.94 8355.31 0.30%
Shell Scripts (1 concurrent) 38942.41 41518.39 6.61%
Shell Scripts (8 concurrent) 37762.35 40224.49 6.52%
System Call Overhead 4064.44 4004.45 -1.48%
========
System Benchmarks Index Score 13634.17 14560.71 6.80%

Signed-off-by: Zeng Heng <zengheng4@xxxxxxxxxx>
---
include/linux/osq_lock.h | 6 +++++-
kernel/locking/osq_lock.c | 8 +++++++-
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 5581dbd3bd34..f33f47ec0c90 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -9,7 +9,11 @@
struct optimistic_spin_node {
struct optimistic_spin_node *next, *prev;
int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
+ /*
+ * Stores an encoded CPU # + 1 value.
+ * Only read by other cpus, so split into different cache lines.
+ */
+ int cpu ____cacheline_aligned;
};
struct optimistic_spin_queue {
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52b92..17618d62343f 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -96,7 +96,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
node->locked = 0;
node->next = NULL;
- node->cpu = curr;
+ /*
+ * After this cpu member is initialized for the first time, it
+ * would no longer change in fact. That could avoid cache misses
+ * when spin and access the cpu member by other CPUs.
+ */
+ if (node->cpu != curr)
+ node->cpu = curr;
/*
* We need both ACQUIRE (pairs with corresponding RELEASE in

The contention on prev->cpu is due to the vcpu_is_preempted() call in osq_lock(). Could you try the attached patch to see if that helps?

Thanks,
Longman

From 4a99deac7bd2a5e05e08d986e5761e9a15775eda Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@xxxxxxxxxx>
Date: Wed, 20 Dec 2023 11:03:34 -0500
Subject: [PATCH] locking/osq_lock: Minimize spinning on prev->cpu

When CONFIG_PARAVIRT_SPINLOCKS is set, osq_lock() will spin on both
node->locked and node->prev->cpu. That can cause contention with another
CPU modifying node->prev. Reduce that contention by caching prev and
prev->cpu and updating the cached values if node->prev changes.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
kernel/locking/osq_lock.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52b92..91293401e3e6 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -87,12 +87,27 @@ osq_wait_next(struct optimistic_spin_queue *lock,
return next;
}

+#ifndef vcpu_is_preempted
+#define prev_vcpu_is_preempted(n, p, c) false
+#else
+static inline bool prev_vcpu_is_preempted(struct optimistic_spin_node *node,
+ struct optimistic_spin_node **pprev,
+ int *ppvcpu)
+{
+ if (node->prev != *pprev) {
+ *pprev = node->prev;
+ *ppvcpu = node_cpu(*pprev);
+ }
+ return vcpu_is_preempted(*ppvcpu);
+}
+#endif
+
bool osq_lock(struct optimistic_spin_queue *lock)
{
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
int curr = encode_cpu(smp_processor_id());
- int old;
+ int old, pvcpu;

node->locked = 0;
node->next = NULL;
@@ -110,6 +125,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)

prev = decode_cpu(old);
node->prev = prev;
+ pvcpu = node_cpu(prev);

/*
* osq_lock() unqueue
@@ -141,7 +157,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* polling, be careful.
*/
if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev))))
+ prev_vcpu_is_preempted(node, &prev, &pvcpu)))
return true;

/* unqueue */
--
2.39.3