Re: [RFC PATCH 05/25] kvx: Add atomic/locking headers

From: Jules Maselbas
Date: Fri Jan 06 2023 - 09:12:10 EST


Hi Mark,

On Wed, Jan 04, 2023 at 09:53:24AM +0000, Mark Rutland wrote:
> On Tue, Jan 03, 2023 at 05:43:39PM +0100, Yann Sionneau wrote:
> > Add common headers (atomic, bitops, barrier and locking) for basic
> > kvx support.
> >
> > CC: Will Deacon <will@xxxxxxxxxx>
> > CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > CC: Boqun Feng <boqun.feng@xxxxxxxxx>
> > CC: Mark Rutland <mark.rutland@xxxxxxx>
> > CC: linux-kernel@xxxxxxxxxxxxxxx
> > Co-developed-by: Clement Leger <clement.leger@xxxxxxxxxxx>
> > Signed-off-by: Clement Leger <clement.leger@xxxxxxxxxxx>
> > Co-developed-by: Jules Maselbas <jmaselbas@xxxxxxxxx>
> > Signed-off-by: Jules Maselbas <jmaselbas@xxxxxxxxx>
> > Co-developed-by: Julian Vetter <jvetter@xxxxxxxxx>
> > Signed-off-by: Julian Vetter <jvetter@xxxxxxxxx>
> > Co-developed-by: Julien Villette <jvillette@xxxxxxxxx>
> > Signed-off-by: Julien Villette <jvillette@xxxxxxxxx>
> > Co-developed-by: Yann Sionneau <ysionneau@xxxxxxxxx>
> > Signed-off-by: Yann Sionneau <ysionneau@xxxxxxxxx>
> > ---
> > arch/kvx/include/asm/atomic.h | 104 +++++++++++++++++
> > arch/kvx/include/asm/barrier.h | 15 +++
> > arch/kvx/include/asm/bitops.h | 207 +++++++++++++++++++++++++++++++++
> > arch/kvx/include/asm/bitrev.h | 32 +++++
> > arch/kvx/include/asm/cmpxchg.h | 185 +++++++++++++++++++++++++++++
> > 5 files changed, 543 insertions(+)
> > create mode 100644 arch/kvx/include/asm/atomic.h
> > create mode 100644 arch/kvx/include/asm/barrier.h
> > create mode 100644 arch/kvx/include/asm/bitops.h
> > create mode 100644 arch/kvx/include/asm/bitrev.h
> > create mode 100644 arch/kvx/include/asm/cmpxchg.h
> >
> > diff --git a/arch/kvx/include/asm/atomic.h b/arch/kvx/include/asm/atomic.h
> > new file mode 100644
> > index 000000000000..eb8acbcbc70d
> > --- /dev/null
> > +++ b/arch/kvx/include/asm/atomic.h
> > @@ -0,0 +1,104 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (C) 2017-2023 Kalray Inc.
> > + * Author(s): Clement Leger
> > + */
> > +
> > +#ifndef _ASM_KVX_ATOMIC_H
> > +#define _ASM_KVX_ATOMIC_H
> > +
> > +#include <linux/types.h>
> > +
> > +#include <asm/cmpxchg.h>
> > +
> > +#define ATOMIC64_INIT(i) { (i) }
> > +
> > +#define arch_atomic64_cmpxchg(v, old, new) (arch_cmpxchg(&((v)->counter), old, new))
> > +#define arch_atomic64_xchg(v, new) (arch_xchg(&((v)->counter), new))
> > +
> > +static inline long arch_atomic64_read(const atomic64_t *v)
> > +{
> > + return v->counter;
> > +}
>
> This is a plain read, and is *not* atomic.
>
> The compiler can replay a plain read an arbitrary number of times, and is
> permitted to split it into smaller accesses.
>
> At minimum this needs to be
>
> READ_ONCE(v->counter)
>
> ... which will prevent replay. Whether or not that's actually atomic will
> depend on the instructions the compiler generates, and how those instructions
> are defines in your architecture.
Good point, we are going to use {READ,WRITE}_ONCE macros

> Do you have a single instruction that can read a 64-bit memory location, and is
> it guaranteed to result in a single access that cannot be split?

We do have a single instruction that can read a 64-bit memory location
(supported sizes are 8, 16, 32, 64, 128, 256).
All accesses are guaranteed to not be split, unless they are misaligned.
Furthermore, misaligned write accesses crossing a 32-byte boundary may
complete in a non-atomic way.

>
> > +static inline void arch_atomic64_set(atomic64_t *v, long i)
> > +{
> > + v->counter = i;
> > +}
>
> Same comments as for arch_atomic64_read(); at minimum this needs to be:
>
> WRITE_ONCE(v->counter, i)
>
> ... but that may or may not actually be atomic on your architecture.
>
> > +#define ATOMIC64_RETURN_OP(op, c_op) \
> > +static inline long arch_atomic64_##op##_return(long i, atomic64_t *v) \
> > +{ \
> > + long new, old, ret; \
> > + \
> > + do { \
> > + old = v->counter; \
> > + new = old c_op i; \
> > + ret = arch_cmpxchg(&v->counter, old, new); \
> > + } while (ret != old); \
> > + \
> > + return new; \
> > +}
> > +
> > +#define ATOMIC64_OP(op, c_op) \
> > +static inline void arch_atomic64_##op(long i, atomic64_t *v) \
> > +{ \
> > + long new, old, ret; \
> > + \
> > + do { \
> > + old = v->counter; \
> > + new = old c_op i; \
> > + ret = arch_cmpxchg(&v->counter, old, new); \
> > + } while (ret != old); \
> > +}
> > +
> > +#define ATOMIC64_FETCH_OP(op, c_op) \
> > +static inline long arch_atomic64_fetch_##op(long i, atomic64_t *v) \
> > +{ \
> > + long new, old, ret; \
> > + \
> > + do { \
> > + old = v->counter; \
> > + new = old c_op i; \
> > + ret = arch_cmpxchg(&v->counter, old, new); \
> > + } while (ret != old); \
> > + \
> > + return old; \
> > +}
>
> These look ok, but it'd be nicer if we could teach the generic atomic64 code to
> do this, like the generic atomic code does.
>
> We could rename the existing asm-generic/atomic64 code to atomic64-spinlock,
> and add a separate atomic64-cmpxchg (and likewise for the 32-bit code) to make
> that clearer and consistent.
I am not sure what this implies and how big this change might be,
but I'll take a look at this.

> > +
> > +#define ATOMIC64_OPS(op, c_op) \
> > + ATOMIC64_OP(op, c_op) \
> > + ATOMIC64_RETURN_OP(op, c_op) \
> > + ATOMIC64_FETCH_OP(op, c_op)
> > +
> > +ATOMIC64_OPS(and, &)
> > +ATOMIC64_OPS(or, |)
> > +ATOMIC64_OPS(xor, ^)
> > +ATOMIC64_OPS(add, +)
> > +ATOMIC64_OPS(sub, -)
> > +
> > +#undef ATOMIC64_OPS
> > +#undef ATOMIC64_FETCH_OP
> > +#undef ATOMIC64_OP
> > +
> > +static inline int arch_atomic_add_return(int i, atomic_t *v)
> > +{
> > + int new, old, ret;
> > +
> > + do {
> > + old = v->counter;
> > + new = old + i;
> > + ret = arch_cmpxchg(&v->counter, old, new);
> > + } while (ret != old);
> > +
> > + return new;
> > +}
> > +
> > +static inline int arch_atomic_sub_return(int i, atomic_t *v)
> > +{
> > + return arch_atomic_add_return(-i, v);
> > +}
>
> Likewise for these two.
>
> > +
> > +#include <asm-generic/atomic.h>
> > +
> > +#endif /* _ASM_KVX_ATOMIC_H */
> > diff --git a/arch/kvx/include/asm/barrier.h b/arch/kvx/include/asm/barrier.h
> > new file mode 100644
> > index 000000000000..371f1c70746d
> > --- /dev/null
> > +++ b/arch/kvx/include/asm/barrier.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (C) 2017-2023 Kalray Inc.
> > + * Author(s): Clement Leger
> > + */
> > +
> > +#ifndef _ASM_KVX_BARRIER_H
> > +#define _ASM_KVX_BARRIER_H
>
> > +/* Bitmask modifiers */
> > +#define __NOP(x) (x)
> > +#define __NOT(x) (~(x))
> > +
> > +
> > +#define __test_and_op_bit(nr, addr, op, mod) \
> > +({ \
> > + unsigned long __mask = BIT_MASK(nr); \
> > + unsigned long __new, __old, __ret; \
> > + do { \
> > + __old = *(&addr[BIT_WORD(nr)]); \
> > + __new = __old op mod(__mask); \
> > + __ret = cmpxchg(addr, __old, __new); \
> > + } while (__ret != __old); \
> > + (__old & __mask); \
> > +})
>
> Please use <asm-generic/bitops/atomic.h> which should give you the common
> bit operations "for free" atop your regular atomics.
Yes

>
> [...]
>
> > diff --git a/arch/kvx/include/asm/cmpxchg.h b/arch/kvx/include/asm/cmpxchg.h
> > new file mode 100644
> > index 000000000000..b1d128b060a2
> > --- /dev/null
> > +++ b/arch/kvx/include/asm/cmpxchg.h
> > @@ -0,0 +1,185 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (C) 2017-2023 Kalray Inc.
> > + * Author(s): Clement Leger
> > + * Yann Sionneau
> > + */
> > +
> > +#ifndef _ASM_KVX_CMPXCHG_H
> > +#define _ASM_KVX_CMPXCHG_H
> > +
> > +#include <linux/bits.h>
> > +#include <linux/types.h>
> > +#include <linux/build_bug.h>
> > +
> > +/*
> > + * On kvx, we have a boolean compare and swap which means that the operation
> > + * returns only the success of operation.
> > + * If operation succeed, this is simple, we just need to return the provided
> > + * old value. However, if it fails, we need to load the value to return it for
> > + * the caller. If the loaded value is different from the "old" provided by the
> > + * caller, we can return it since it will means it failed.
> > + * However, if for some reason the value we read is equal to the old value
> > + * provided by the caller, we can't simply return it or the caller will think it
> > + * succeeded. So if the value we read is the same as the "old" provided by
> > + * the caller, we try again until either we succeed or we fail with a different
> > + * value than the provided one.
> > + */
> > +#define __cmpxchg(ptr, old, new, op_suffix, load_suffix) \
> > +({ \
> > + register unsigned long __rn asm("r62"); \
> > + register unsigned long __ro asm("r63"); \
>
> Why do you need to specify the exact registers?
r62 and r63 are hardcoded in the inline assembly, they are caller saved.
I have a C implementation that uses builtins however this is not merged
in our tree yet (but I want to).

> e.g. does some instruction use these implicitly, or do you need two adjacent
> register for encoding reasons?

The atomic compare and swap (acswap) instruction needs a register "pair"
which can only exists with "adjacent" registers: $r0r1, $r2r3 ect.

> > + __asm__ __volatile__ ( \
> > + /* Fence to guarantee previous store to be committed */ \
> > + "fence\n" \
>
> This implies you can implement the relaxed form of cmpxchg().
>
> What ordering do you get by default, and do you have any other barriers (e.g.
> for acquire/release semantics), or just "fence" ?
We have two barrier types:
- fence: ensure that all uncached memory operations are committed to
memory, typically used to ensure a write to memory is visible to
other cores.
- barrier: flush the core instruction pipeline and memory system

Thanks,
-- Jules