Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

From: Nicholas Piggin
Date: Tue Mar 16 2021 - 01:00:44 EST


Excerpts from Davidlohr Bueso's message of March 9, 2021 11:59 am:
> 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> busy-waiting pausing with a preferred SMT priority pattern, lowering
> the priority (reducing decode cycles) during the whole loop slowpath.
>
> However, data shows that while this pattern works well with simple
> spinlocks, queued spinlocks benefit more being kept in medium priority,
> with a cpu_relax() instead, being a low+medium combo on powerpc.

Thanks for tracking this down and the comprehensive results, great
work.

It's only a relatively recent patch, so I think the revert is a
good idea (i.e., don't keep it around for possibly other code to
hit problems with).

One request, could you add a comment in place that references
smp_cond_load_relaxed() so this commit can be found again if
someone looks at it? Something like this

/*
* smp_cond_load_relaxed was found to have performance problems if
* implemented with spin_begin()/spin_end().
*/

I wonder if it should have a Fixes: tag to the original commit as
well.

Otherwise,

Acked-by: Nicholas Piggin <npiggin@xxxxxxxxx>

Thanks,
Nick

>
> Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
> 2 sockets and 8 threads per core.
>
> 1. locktorture.
>
> This is data for the lowest and most artificial/pathological level,
> with increasing thread counts pounding on the lock. Metrics are total
> ops/minute. Despite some small hits in the 4-8 range, scenarios are
> either neutral or favorable to this patch.
>
> +=========+==========+==========+=======+
> | # tasks | vanilla | dirty | %diff |
> +=========+==========+==========+=======+
> | 2 | 46718565 | 48751350 | 4.35 |
> +---------+----------+----------+-------+
> | 4 | 51740198 | 50369082 | -2.65 |
> +---------+----------+----------+-------+
> | 8 | 63756510 | 62568821 | -1.86 |
> +---------+----------+----------+-------+
> | 16 | 67824531 | 70966546 | 4.63 |
> +---------+----------+----------+-------+
> | 32 | 53843519 | 61155508 | 13.58 |
> +---------+----------+----------+-------+
> | 64 | 53005778 | 53104412 | 0.18 |
> +---------+----------+----------+-------+
> | 128 | 53331980 | 54606910 | 2.39 |
> +=========+==========+==========+=======+
>
> 2. sockperf (tcp throughput)
>
> Here a client will do one-way throughput tests to a localhost server, with
> increasing message sizes, dealing with the sk_lock. This patch shows to put
> the performance of the qspinlock back to par with that of the simple lock:
>
> simple-spinlock vanilla dirty
> Hmean 14 73.50 ( 0.00%) 54.44 * -25.93%* 73.45 * -0.07%*
> Hmean 100 654.47 ( 0.00%) 385.61 * -41.08%* 771.43 * 17.87%*
> Hmean 300 2719.39 ( 0.00%) 2181.67 * -19.77%* 2666.50 * -1.94%*
> Hmean 500 4400.59 ( 0.00%) 3390.77 * -22.95%* 4322.14 * -1.78%*
> Hmean 850 6726.21 ( 0.00%) 5264.03 * -21.74%* 6863.12 * 2.04%*
>
> 3. dbench (tmpfs)
>
> Configured to run with up to ncpusx8 clients, it shows both latency and
> throughput metrics. For the latency, with the exception of the 64 case,
> there is really nothing to go by:
> vanilla dirty
> Amean latency-1 1.67 ( 0.00%) 1.67 * 0.09%*
> Amean latency-2 2.15 ( 0.00%) 2.08 * 3.36%*
> Amean latency-4 2.50 ( 0.00%) 2.56 * -2.27%*
> Amean latency-8 2.49 ( 0.00%) 2.48 * 0.31%*
> Amean latency-16 2.69 ( 0.00%) 2.72 * -1.37%*
> Amean latency-32 2.96 ( 0.00%) 3.04 * -2.60%*
> Amean latency-64 7.78 ( 0.00%) 8.17 * -5.07%*
> Amean latency-512 186.91 ( 0.00%) 186.41 * 0.27%*
>
> For the dbench4 Throughput (misleading but traditional) there's a small
> but rather constant improvement:
>
> vanilla dirty
> Hmean 1 849.13 ( 0.00%) 851.51 * 0.28%*
> Hmean 2 1664.03 ( 0.00%) 1663.94 * -0.01%*
> Hmean 4 3073.70 ( 0.00%) 3104.29 * 1.00%*
> Hmean 8 5624.02 ( 0.00%) 5694.16 * 1.25%*
> Hmean 16 9169.49 ( 0.00%) 9324.43 * 1.69%*
> Hmean 32 11969.37 ( 0.00%) 12127.09 * 1.32%*
> Hmean 64 15021.12 ( 0.00%) 15243.14 * 1.48%*
> Hmean 512 14891.27 ( 0.00%) 15162.11 * 1.82%*
>
> Measuring the dbench4 Per-VFS Operation latency, shows some very minor
> differences within the noise level, around the 0-1% ranges.
>
> Signed-off-by: Davidlohr Bueso <dbueso@xxxxxxx>
> ---
> arch/powerpc/include/asm/barrier.h | 16 ----------------
> 1 file changed, 16 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index aecfde829d5d..7ae29cfb06c0 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -80,22 +80,6 @@ do { \
> ___p1; \
> })
>
> -#ifdef CONFIG_PPC64
> -#define smp_cond_load_relaxed(ptr, cond_expr) ({ \
> - typeof(ptr) __PTR = (ptr); \
> - __unqual_scalar_typeof(*ptr) VAL; \
> - VAL = READ_ONCE(*__PTR); \
> - if (unlikely(!(cond_expr))) { \
> - spin_begin(); \
> - do { \
> - VAL = READ_ONCE(*__PTR); \
> - } while (!(cond_expr)); \
> - spin_end(); \
> - } \
> - (typeof(*ptr))VAL; \
> -})
> -#endif
> -
> #ifdef CONFIG_PPC_BOOK3S_64
> #define NOSPEC_BARRIER_SLOT nop
> #elif defined(CONFIG_PPC_FSL_BOOK3E)
> --
> 2.26.2
>
>