Re: [PATCH v16 3/8] cpu/hotplug: Add dynamic parallel bringup states before CPUHP_BRINGUP_CPU

From: David Woodhouse
Date: Fri Mar 24 2023 - 05:32:14 EST


On Fri, 2023-03-24 at 02:16 +0100, Thomas Gleixner wrote:
> On Thu, Mar 23 2023 at 23:12, David Woodhouse wrote:
> > On Fri, 2023-03-24 at 00:05 +0100, Thomas Gleixner wrote:
> > > +                       if (ret && can_rollback_cpu(st))
> > > +                               WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE));
> > > +               }
> >
> > And I'll take doing this bit unconditionally (it's basically a no-op if
> > they already got rolled all the way back to CPUHP_OFFLINE, right?).
> >
> > But the additional complexity of having multiple steps is fairly
> > minimal, and I'm already planning to *use* another one even in x86, as
> > discussed.
>
> It's not about the "complexity". That's a general design question and
> I'm not agreeing with your approach of putting AP specifics into the BP
> state space.
>
> The BP only phase ends at the point where the AP is woken up via
> SIPI/INIT/whatever. Period.
>
> And no, we are not making this special just because it's the easiest way
> to get it done. I have _zero_ interest in this kind of hackery which
> just slaps stuff into the code where its conveniant without even
> thinking about proper separations
>
> We went a great length to separate things clearly and it takes more than
> "oh let's reserve a few special states" to keep this separation
> intact. That's a matter of correctness, maintainability and taste.
>
> That microcode thing on X86 has absolutely no business in the pre
> bringup DYN states. It's an AP problem and it can be solved completely
> without BP interaction.

As long as the BP doesn't bring any more siblings online during the
update, sure.

> And before you start drooling over further great parallelism, can you
> please take a step back and come up with a sensible design for the
> actual real world requirments?
>
> The point is that after an AP comes out of lala land and starts
> executing there are mandatory synchronization points which need to be
> handled by design. The number of synchronization points is architecture
> and platform specific and might be 0, but thats a detail.
>
> So the proper thing is to split CPUHP_BRINGUP_CPU, which is the bridging
> state between AP and BP today, into a set of synchronization points
> between BP and AP.

I feel we're talking past each other a little bit. Because I'd have
said that's *exactly* what this patch set is doing.

The existing x86 CPUHP_BRINGUP_CPU step in native_cpu_up() has a bunch
of back-and-forth between BP and AP, which you've enumerated below and
which I cleaned up and commented in the 'Split up native_cpu_up into
separate phases and document them' patch.

This patch set literally introduces the new PARALLEL_DYN states to be
"a set of synchronization points between BP and AP", and then makes the
x86 code utilise that for the *first* of its back-and-forth exchanges
between BP and AP.

The patch to do the second stage and truly make it a 'set' on x86 is
waiting in the wings, precisely because it *happens* not to catch fire
yet, but hasn't been properly audited yet.

But it seemed to me that you were saying you want us to limit
architectures to just *one* additional step (CPUHP_BP_PARALLEL_STARTUP)
instead of the 'set' that I'd added with the PARALLEL_DYN states? Did I
misunderstand?

> But that's non-trivial because if you look at bringup_cpu() then you'll
> notice that this state has an implicit protection against interrupt
> allocation/free and quite some architectures rely on this for their
> early bringup code and possibly their STARTING state callbacks.

I literally pointed that out in 2021 (in one of the messages I
reference below).

For x86 I don't believe that's going to be an issue yet, if at all. I
think it matters for the lapic_online() call which happens near the end
of start_secondary(); it's almost the last thing it does before going
into the generic AP startup thread. It's *also* preceded by a comment
that at least *suggests* it ought to be fine anyway, although we should
never entirely trust such comments.

/*
* Lock vector_lock, set CPU online and bring the vector
* allocator online. Online must be set with vector_lock held
* to prevent a concurrent irq setup/teardown from seeing a
* half valid vector space.
*/
lock_vector_lock();
/* Sync point with do_wait_cpu_online() */
set_cpu_online(smp_processor_id(), true);
lapic_online();

We're a long way from parallelizing *that* one and needing to check the
veracity of the comment though.

> That aside. Let's just look at x86 as of today from the BP side:
>
>     1) Wakeup AP
>     2) Wait until there is sign of life
>     3) Let AP proceed
>     5) Wait until AP is done with init
>     6) TSC synchronization
>     7) Wait until it is online
>
> and on the AP side:
>
>     1) Do low level init
>     2) Report life
>     3) Wait until BP allows to proceed
>     4) Do init
>     5) Report done
>     6) TSC synchronization
>     7) Report online
>
> So surely you could claim that up to #6 (TSC sync) nothing needs to
> synchronize.

I don't think we could claim that at all. There are a whole bunch of
implicit dependencies on the fact that this happens in series, and at
any given point in the sequence we might know that *either* the BP is
free to act, *or* precisely one of the APs is free. For example:

> But that's simply not true because topology information is implicitely
> serialized by CPU hotplug and your attempt to serialize that in patch
> 7/8 is not cutting it at all because AP #4 (STARTING) callbacks rely
> implicitely on the immutability of the topology information

Right, and that's just fine. That patch explicitly calls out that it is
preparation for *future* parallelism. Which is why it's *last* in the
series (well, apart from the SEV-ES bit, which I wanted to be even
*more* last-in-the-series than that).

If it was needed for the "phase 1" parallelism of INIT/SIPI/SIPI that
gets enabled in patch #6, then it would have come before that in the
series.

It is necessary — but not sufficient, as you correctly point out — for
the *next* stages of parallelism.

> and so do some of the later (threaded) AP callbacks too.
>
> As I said before 5 out of 10 callbacks I looked at are not ready for
> this.
>
> Just because it did not explode in your face yet, does not make any of
> your wet dreams more correct.

Now you're looking even further into the future. We're not trying to
make the AP startup threads run in parallel yet.

> Again: I'm not interested in this kind of "works for me" nonsense at
> all.

You keep saying that. It's starting to become offensive.

Yes, the first RFC posting of this approach was quite explicitly billed
as 'proof-of-concept hack' and 'don't look too hard because it's only
really for figuring out the timing'. I make no apology for that:
https://lore.kernel.org/lkml/8ac72d7b287ed1058b2dec3301578238aff0abdd.camel@xxxxxxxxxxxxx/

And if you look back at the first actual series that was posted, it's
also very clear about which bit is actually ready because all the
testing and thinking has been done, and which isn't:
https://lore.kernel.org/lkml/478c56af540eaa47b8f452e51cb0c085f26db738.camel@xxxxxxxxxxxxx/

So please, stop giving me this "\"works for me\" nonsense" nonsense.

Yes, there are some patches on top of this lot that aren't ready yet;
we've fixed up *some* of the things that actually broke, and only done
a cursory pass of actually inspecting it to make sure it's safe. And
some of your observations above are very relevant and helpful to when
we come to do *that* part of the thinking.

But we aren't asking you to merge those parts yet. We have deliberately
cut it back to the initial stage because that's where the biggest win
is, and there's *enough* thinking to be done just for this part.

We will review your comments and be happy to engage in further robust
discussion about the future patches when we get round to that (if
indeed we conclude that there's a significant win to be had from them).

In the meantime, if there are specific things which are wrong with the
parallelism introduced by *this* patch series that you're trying to
point out to us, then I apologise; I've missed them.

Attachment: smime.p7s
Description: S/MIME cryptographic signature