Re: [PATCH] x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other_cpus()

From: Uros Bizjak
Date: Wed Nov 22 2023 - 15:19:03 EST


On Fri, Nov 17, 2023 at 6:23 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 11/17/23 08:37, Uros Bizjak wrote:
> > On Fri, Nov 17, 2023 at 5:18 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
> >> Is there some other side benefit that I'm missing here? Applying this
> >> patch doesn't seem to have a great risk/reward ratio.
> >
> > Yes, in addition to better asm code, I think that the use of magic
> > constant (-1) is not descriptive at all. I tried to make this code
> > look like nmi_panic() from kernel/panic.c, which has similar
> > functionality, and describe that this constant belongs to old_cpu
> > (same as in nmi_panic() ).
>
> I guess it's a step in the direction of nmi_panic(). But honestly, it
> doesn't go far enough for me at least. The nice part about nmi_panic()
> is that it gives -1 nice symbolic name and uses that name in the
> definition of the atomic_t.
>
> > Also, from converting many cmpxchg to try_cmpxchg, it becomes evident
> > that in cases like this (usage in "if" clauses) the correct locking
> > primitive is try_cmpxchg. Additionally, in this particular case, it
> > is not the speed, but a little code save that can be achieved with
> > the same functionality.
>
> I think I understand what you're trying to say: using try_cmpxchg can
> result in better code generation in general than plain cmpxchg. Thus,
> it's more "correct" to use try_cmpxchg in any case where plain cmpxchg
> is in use. Right?

Yes, when we have the condition "cmpxchg(*ptr, old, new) == old", then
using try_cmpxchg not only generates better code (note also how the
compiler creates fast path through the code due to likely/unlikely
annotations), but is also *less* error prone. E.g., there were a
couple of instances where the result of cmpxchg was compared to the
wrong variable.

> I honestly don't think cmpxchg is more or less "correct" than
> try_cmpxchg. If you're going to be sending patches like these, I'd
> remove this kind of language from your changelogs and discussions
> because it holds zero weight.
>
> Here's what I'm afraid of: this patch is not tested enough. We apply it
> and then start getting reports of reboot breaking on some server.
> Someone spends two hours bisecting to this patch. We'll wonder after
> the fact: Was this patch worth it?

Let me just say that after some 50 conversions of code of various
complexity to try_cmpxchg, fixing inconsistencies and plain logic bugs
on the way, removing superfluous memory reads form the loops,
eyeballing generated asm code and persuading relevant maintainers
about the benefit of the conversion, I can confidently say that this
particular conversion won't make troubles.

Even without the proposed conversion, the call to smp_processor_id()
should be put after early exit where reboot_force is checked.

> I don't think what you have here is more descriptive than what was there
> before. It still has a -1 and still doesn't logically connect it to the
> 'stopping_cpu' definition. I have no idea how much this was tested. It
> doesn't _clearly_ move the needle enough on making the code better to
> apply it.

I should state that I test my patches by bootstrapping the image in
qemu, and from time to time put together a bunch of patches and build
a real Fedora kernel rpm and run it for some time as my main kernel.
This last part gives me much confidence when the patch is discussed
with the maintainer. In this particular case, I didn't put much
attention on reboot functionality, but the patched kernel definitely
reboots without problems.

Regarding "-1", I was thinking of introducing a CPU_INVALID #define
instead of -1, but at the end, this #define should be eventually
introduced as a target-independent patch that puts the new define into
the generic part of the kernel source, since it looks like other
targets could use this define as well.

> We shouldn't be blindly converting cmpxchg=>try_cmpxchg, and then trying
> to justify it after the fact. I _do_ agree that try_cmpxchg() leads to
> better looking C code *AND* generated code. So, for new x86 code, it
> seems like the best thing to do. But for old (working) code, I think it
> should mostly be left alone.

I'm not here to discuss policies, but the "don't fix it if it ain't
broke" policy is a slippery slope, as it can lead to stagnation in the
long term. Please read what happened here [1].

[1] https://lwn.net/Articles/488847/

> Maybe you could keep an eye on:
>
> > https://lore.kernel.org/lkml/?q=dfb%3Aatomic_cmpxchg+-dfb%3Aatomic_try_cmpxchg
>
> and remind folks what they should be using, rather than expending
> efforts on trying to move existing code over.

Yes, I'm doing my best to point out optimization opportunities
involving try_cmpxchg when I come across one. But this one slipped
below my radar...

Thanks and BR,
Uros.