RE: [PATCH v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node

From: David Laight
Date: Fri Dec 22 2023 - 07:40:44 EST


From: Zeng Heng
> Sent: 22 December 2023 12:11
>
> Using the UnixBench test suite, we clearly find that osq_lock() cause
> extremely high overheads with perf tool in the File Copy items:
>
> Overhead Shared Object Symbol
> 94.25% [kernel] [k] osq_lock
> 0.74% [kernel] [k] rwsem_spin_on_owner
> 0.32% [kernel] [k] filemap_get_read_batch
>
> In response to this, we conducted an analysis and made some gains:
>
> In the prologue of osq_lock(), it set `cpu` member of percpu struct
> optimistic_spin_node with the local cpu id, after that the value of the
> percpu struct would never change in fact. Based on that, we can regard
> the `cpu` member as a constant variable.
>
...
> @@ -9,7 +11,13 @@
> struct optimistic_spin_node {
> struct optimistic_spin_node *next, *prev;
> int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> +
> + CACHELINE_PADDING(_pad1_);
> + /*
> + * Stores an encoded CPU # + 1 value.
> + * Only read by other cpus, so split into different cache lines.
> + */
> + int cpu;
> };

Isn't this structure embedded in every mutex and rwsem (etc)?
So that is a significant bloat especially on systems with
large cache lines.

Did you try just moving the initialisation of the per-cpu 'node'
below the first fast-path (uncontended) test in osq_lock()?

OTOH if you really have multiple cpu spinning on the same rwsem
perhaps the test and/or filemap code are really at fault!

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)