Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

From: David Woodhouse
Date: Wed Dec 08 2021 - 13:32:29 EST


On Wed, 2021-12-08 at 09:35 -0800, Paul E. McKenney wrote:
> On Wed, Dec 08, 2021 at 04:57:07PM +0000, David Woodhouse wrote:
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index ef8d36f580fc..544198c674f2 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -4246,11 +4246,11 @@ void rcu_cpu_starting(unsigned int cpu)
> >
> > rnp = rdp->mynode;
> > mask = rdp->grpmask;
> > + raw_spin_lock_irqsave_rcu_node(rnp, flags);
>
> If I am not too confused this morning, this can result in confusing
> lockdep splats because lockdep needs RCU to be watching the CPU
> acquiring the lock. See the rcu_lockdep_current_cpu_online()
> function and is callers, with emphasis on lockdep_rcu_suspicious()
> and rcu_read_lock_held_common().

Hm, OK. And it is the very act of setting rnp->ofl_seq & 1 which
triggers that, yes?

> > WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> > WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
> > rcu_dynticks_eqs_online();
> > smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> > - raw_spin_lock_irqsave_rcu_node(rnp, flags);



> > WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
> > newcpu = !(rnp->expmaskinitnext & mask);
> > rnp->expmaskinitnext |= mask;
> > @@ -4261,6 +4261,11 @@ void rcu_cpu_starting(unsigned int cpu)
> > rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
> > rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
> >
> > + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
> > + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
> > + WARN_ON_ONCE(rnp->ofl_seq & 0x1);
> > + smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
> > +
> > /* An incoming CPU should never be blocking a grace period. */
> > if (WARN_ON_ONCE(rnp->qsmask & mask)) { /* RCU waiting on incoming CPU? */
> > rcu_disable_urgency_upon_qs(rdp);
> > @@ -4269,10 +4274,6 @@ void rcu_cpu_starting(unsigned int cpu)
> > } else {
> > raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>
> And ditto here upon release.
>
> As a short-term hack, I suggest moving the ->ofl_seq field from the
> rcu_node structure to the rcu_data structure. This will require the loop
> in rcu_gp_init() to wait on each of the current rcu_node structure's CPUs.
> Which is not good from the viewpoint of the RCU grace-period kthread's
> CPU consumption, but it should allow you to make progress on your testing.

Ok, thanks. My initial hack of sticking my own spinlock around the
whole thing was also working for testing, but now I'm trying to clean
it up so I can post something for merging.

> Though I are having some difficulty remembering why that wait loop in
> rcu_gp_init() needs to be there. I am going to try removing it and
> seeing if rcutorture will be kind enough to remind me. ;-)
>
> And it will of course be necessary to upgrade rcutorture to test
> concurrent CPU-online operations. Will there be some sort of
> start-CPU-online function, or should I instead expect to need to
> provide multiple kthreads for onlining and an additional kthread
> for offliing?

This is just at *boot* time, not runtime hotplug/unplug. We observed
that we spend quite a lot of time on a 96-way 2-socket Skylake system
just sending INIT to each CPU in turn, then waiting for it to be fully
online, then moving on to the next one. Hence doing them all in
parallel, which reduces the AP bringup time from about 300ms to 30ms.

https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel-5.16

> Huh. I take it that concurrent online and offline is future work?
> Or does that need to work initially?
>

Concurrent *online* (at boot) is the whole point. Those last two
commits currently in the branch linked above are the "oh crap, *that*
part doesn't work if you really let it happen concurrently, so let's
serialize them" hacks. In particular, the RCU one is
https://git.infradead.org/users/dwmw2/linux.git/commitdiff/5f4b77c9459c

And now I'm trying to come up with something a little less hackish :)

> More to the point, what are you using to stress-test this capability?

Just boot. With lots of CPUs (and vCPUs in qemu, but even with a nice
fast parallel CPU bringup, Linux then spends the next 16 seconds
printing silly pr_info messages about KVM features so it isn't the most
exciting overall result right now)

I confess I haven't actually tested runtime hotplug/unplug again
recently. I should do that ;)

Attachment: smime.p7s
Description: S/MIME cryptographic signature