Re: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics

From: Paul E. McKenney
Date: Tue Oct 31 2023 - 10:43:23 EST


On Tue, Oct 31, 2023 at 02:16:34PM +0000, Michael Matz wrote:
> Hello,
>
> On Tue, 31 Oct 2023, Peter Zijlstra wrote:
>
> (I can't say anything about the WRITE_ONCE/rcu code, just about the below
> codegen part)
>
> > Welcome is not the right word. What bugs me most is that this was never
> > raised when this code was written :/
> >
> > Mostly my problem is that GCC generates such utter shite when you
> > mention volatile. See, the below patch changes the perfectly fine and
> > non-broken:
> >
> > 0148 1d8: 49 83 06 01 addq $0x1,(%r14)
>
> What is non-broken here that is ...
>
> > into:
> >
> > 0148 1d8: 49 8b 06 mov (%r14),%rax
> > 014b 1db: 48 83 c0 01 add $0x1,%rax
> > 014f 1df: 49 89 06 mov %rax,(%r14)
>
> ... broken here? (Sure code size and additional register use, but I don't
> think you mean this with broken).
>
> > For absolutely no reason :-(
>
> The reason is simple (and should be obvious): to adhere to the abstract
> machine regarding volatile. When x is volatile then x++ consists of a
> read and a write, in this order. The easiest way to ensure this is to
> actually generate a read and a write instruction. Anything else is an
> optimization, and for each such optimization you need to actively find an
> argument why this optimization is correct to start with (and then if it's
> an optimization at all). In this case the argument needs to somehow
> involve arguing that an rmw instruction on x86 is in fact completely
> equivalent to the separate instructions, from read cycle to write cycle
> over all pipeline stages, on all implementations of x86. I.e. that a rmw
> instruction is spec'ed to be equivalent.
>
> You most probably can make that argument in this specific case, I'll give
> you that. But why bother to start with, in a piece of software that is
> already fairly complex (the compiler)? It's much easier to just not do
> much anything with volatile accesses at all and be guaranteed correct.
> Even more so as the software author, when using volatile, most likely is
> much more interested in correct code (even from a abstract machine
> perspective) than micro optimizations.

I suspect that Peter cares because the code in question is on a fast
path within the scheduler.

> > At least clang doesn't do this, it stays:
> >
> > 0403 413: 49 ff 45 00 incq 0x0(%r13)
> >
> > irrespective of the volatile.
>
> And, are you 100% sure that this is correct? Even for x86 CPU
> pipeline implementations that you aren't intimately knowing about? ;-)

Me, I am not even sure that the load-increment-store is always faithfully
executed by the hardware. ;-)

> But all that seems to be a side-track anyway, what's your real worry with
> the code sequence generated by GCC?

Again, my guess is that Peter cares deeply about the speed of the fast
path on which this increment occurs.

Thanx, Paul