Re: Finding open-coded workarounds for 1/2-byte cmpxchg()?

From: Paul E. McKenney
Date: Tue Apr 09 2024 - 15:27:25 EST

Next message: David Hildenbrand: "[PATCH v1 10/18] mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()"
Previous message: David Hildenbrand: "[PATCH v1 09/18] mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Apr 05, 2024 at 04:18:29PM -0700, Paul E. McKenney wrote:
> On Sat, Apr 06, 2024 at 01:00:35AM +0200, Julia Lawall wrote:
> >
> >
> > On Thu, 4 Apr 2024, Paul E. McKenney wrote:
> >
> > > Hello, Julia!
> > >
> > > I hope that things are going well for you and yours.
> > >
> > > TL;DR: Would you or one of your students be interested in looking for
> > > some interesting code patterns involving cmpxchg? If such patterns exist,
> > > we would either need to provide fixes or to drop support for old systems.
> > >
> > > If this would be of interest, please read on!
> > >
> > > Arnd (CCed) and I are looking for open-coded emulations for one-byte
> > > and two-byte cmpxchg(). Such emulations might be attempting to work
> > > around the fact that not all architectures support those sizes, being
> > > as they are only required to support four-byte cmpxchg() and, if they
> > > are 64-bit architectures, eight-byte cmpxchg().
> > >
> > > There is a one-byte emulation in RCU (kernel/rcu/tasks.h), which looks
> > > like this:
> > >
> > > ------------------------------------------------------------------------
> > >
> > > u8 rcu_trc_cmpxchg_need_qs(struct task_struct *t, u8 old, u8 new)
> > > {
> > > union rcu_special ret;
> > > union rcu_special trs_old = READ_ONCE(t->trc_reader_special);
> > > union rcu_special trs_new = trs_old;
> > >
> > > if (trs_old.b.need_qs != old)
> > > return trs_old.b.need_qs;
> > > trs_new.b.need_qs = new;
> > > ret.s = cmpxchg(&t->trc_reader_special.s, trs_old.s, trs_new.s);
> > > return ret.b.need_qs;
> > > }
> > >
> > > ------------------------------------------------------------------------
> > >
> > > An additional issue is posed by these, also in kernel/rcu/tasks.h:
> > >
> > > ------------------------------------------------------------------------
> > >
> > > if (trs.b.need_qs == (TRC_NEED_QS_CHECKED | TRC_NEED_QS)) {
> > >
> > > return smp_load_acquire(&t->trc_reader_special.b.need_qs);
> > >
> > > smp_store_release(&t->trc_reader_special.b.need_qs, v);
> > >
> > > ------------------------------------------------------------------------
> > >
> > > The additional issue is that these statements assume that each CPU
> > > architecture has single-byte load and store instructions, which some of
> > > the older Alpha systems do not. Fortunately for me, Arnd was already
> > > thinking in terms of removing support for these systems.
> > >
> > > But there are additional systems that do not support 16-bit loads and
> > > stores. So if there is a 16-bit counterpart to rcu_trc_cmpxchg_need_qs()
> > > on a quantity that is also subject to 16-bit loads or stores, either
> > > that function needs adjustment or a few more ancient systems need to
> > > lose their Linux-kernel support.
> > >
> > > Again, is looking for this sort of thing something that you or one of
> > > your students would be interested in?
> >
> > Hello,
> >
> > I tried, but without much success. The following looks a little bit
> > promising, eg the use of the variable name "want", but it's not clear that
> > the rest of the context fits the pattern.
>
> Thank you for digging into this!!!
>
> > diff -u -p /home/julia/linux/net/sunrpc/xprtsock.c
> > /tmp/nothing/net/sunrpc/xprtsock.c
> > --- /home/julia/linux/net/sunrpc/xprtsock.c
> > +++ /tmp/nothing/net/sunrpc/xprtsock.c
> > @@ -690,12 +690,9 @@ xs_read_stream(struct sock_xprt *transpo
> > if (ret <= 0)
> > goto out_err;
> > transport->recv.offset = ret;
> > - if (transport->recv.offset != want)
> > - return transport->recv.offset;
>
> Agreed, though you are quite right that ->recv.copied and ->recv.offset
> are different lengths. But yes, as you sugggest below, there must be
> a cmpxchg() of some type (cmpxchg(), cmpxchg_acquire(), ...) in the mix
> somewhere. Also, the cmpxchg() must be applied to a pointer to either
> a 32-bit or a 64-bit quantity, but the change must be 16 bits (or 8 bits).
>
> > The semantic patch in question was:
> >
> > @r@
> > expression olde;
> > idexpression old;
> > @@
> >
> > if (olde != old) { ... return olde; }
> >
> > @@
> > expression newe != r.olde;
> > idexpression nw;
> > expression r.olde;
> > idexpression r.old;
> > @@
> >
> > *if (olde != old) { ... return olde; }
> > ...
> > *newe = nw;
> > ...
> > *return newe;
> >
> > The semantic patch doesn't include the cmpxchg. I wasn't sure if that
> > would always be present, or in what form.
>
> It would be, but I am having trouble characterizing exactly what the
> pattern would look like beyond "emulating a 16-bit cmpxchg() using either
> a 32-bit cmpxchg() or a 64-bit cmpxchg()". :-(
>
> Thank you again, and something to think more about.

I took the crude approach of looking at all of the cmpxchg*() invocations,
discarding those that were clearly not an issue. Here are the close calls
that I found:

o drivers/misc/genwqe/card_ddcb.c enqueue_ddcb() does work against
a union that has 32-bit, 16-bit, and eight-bit members, but as
far as I can see the ->icrc_16 member is not used. But this
might be an accident waiting to happen. Or maybe this driver
is used only by architectures with a full set of cmpxchg sizes.

The 8-bit ->hsi and ->shi fields are used for debug output, which
should be harmless.

Ditto __genwqe_purge_ddcb() that same file.

o drivers/platform/surface/aggregator/controller.c ssh_seq_next()
does an 8-bit cmpxchg().

o drivers/platform/surface/aggregator/controller.c ssh_rqid_next()
does a 16-bit cmpxchg().

o kernel/locking/qspinlock_paravirt.h pv_wait_node() does
an 8-bit cmpxchg(). As does pv_kick_node() in that same
file. And __pv_queued_spin_unlock(). And, as Arnd noted,
trylock_clear_pending() and pv_hybrid_queued_unfair_trylock()
in that same file do 16-bit cmpxchg_acquire().

o net/rxrpc/io_thread.c rxrpc_input_packet_on_conn() is strange
in that it supplies 16-bit old and new fields to a cmpxchg()
of a 32-bit quantity, but it is quite possible that this would
be a 16-bit quantity if permitted in core code. This is in an
rxrpc_connection structure.

There are no doubt some false negatives omitted from this list, but
there are a few places that use or would like to use 8-bit and 16-bit
cmpxchg*().

Thanx, Paul

Next message: David Hildenbrand: "[PATCH v1 10/18] mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()"
Previous message: David Hildenbrand: "[PATCH v1 09/18] mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]