Re: [PATCH v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node

From: Zeng Heng
Date: Sat Dec 23 2023 - 03:54:55 EST



在 2023/12/22 20:40, David Laight 写道:
From: Zeng Heng
Sent: 22 December 2023 12:11

Using the UnixBench test suite, we clearly find that osq_lock() cause
extremely high overheads with perf tool in the File Copy items:

Overhead Shared Object Symbol
94.25% [kernel] [k] osq_lock
0.74% [kernel] [k] rwsem_spin_on_owner
0.32% [kernel] [k] filemap_get_read_batch

In response to this, we conducted an analysis and made some gains:

In the prologue of osq_lock(), it set `cpu` member of percpu struct
optimistic_spin_node with the local cpu id, after that the value of the
percpu struct would never change in fact. Based on that, we can regard
the `cpu` member as a constant variable.

...
@@ -9,7 +11,13 @@
struct optimistic_spin_node {
struct optimistic_spin_node *next, *prev;
int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
+
+ CACHELINE_PADDING(_pad1_);
+ /*
+ * Stores an encoded CPU # + 1 value.
+ * Only read by other cpus, so split into different cache lines.
+ */
+ int cpu;
};
Isn't this structure embedded in every mutex and rwsem (etc)?
So that is a significant bloat especially on systems with
large cache lines.

Did you try just moving the initialisation of the per-cpu 'node'
below the first fast-path (uncontended) test in osq_lock()?

OTOH if you really have multiple cpu spinning on the same rwsem
perhaps the test and/or filemap code are really at fault!

David

Hi,

The File Copy items of UnixBench testsuite are using 1 read file and 1 write file

for file read/write/copy test. In multi-parallel scenario, that would lead to high

file lock contention.

That is just a performance test suite and has nothing to do with whether the user

program design is correct or not.


B.R.,

Zeng Heng