Re: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

From: Nicholas Piggin
Date: Wed Apr 10 2024 - 00:27:41 EST


On Tue Apr 9, 2024 at 11:07 PM AEST, Oleg Nesterov wrote:
> On 04/09, Frederic Weisbecker wrote:
> >
> > Le Sun, Apr 07, 2024 at 03:09:14PM +0200, Oleg Nesterov a écrit :
> > > Well, the changelog says
> > >
> > > nohz_full has been trialed at a large supercomputer site and found to
> > > significantly reduce jitter. In order to deploy it in production, they
> > > need CPU0 to be nohz_full
> > >
> > > so I guess this feature has users.

It was the Summit/Sierra supercomputers which I suppose are still using
it. IIRC it was an existing job scheduler system they had which ran
housekeeping work on the highest numbered core in a socket and allocated
jobs from lowest number. We certainly asked if they could change that,
but apparently that was difficult. I'm surprised nobody ran into it on
x86 though. Maybe the system had more jitter (SMT4 doesn't help), so
maybe it wasn't needed to use isolcpus= in other cases.

The other thing is powerpc can boot on arbitrary CPU number. So if boot
CPU must be in the mask then it could randomly break your boot
config if boot CPU must be in HK mask.

> > >
> > > But after the commit aae17ebb53cd3da ("workqueue: Avoid using isolated cpus'
> > > timers on queue_delayed_work") the kernel will crash at boot time if the boot
> > > CPU is nohz_full.
> >
> > Right but there are many possible calls to housekeeping on boot before the
> > first housekeeper becomes online.
>
> Well, it seems that other callers are more or less fine in this respect...
> At least the kernel boots fine with that commit reverted.
>
> But a) I didn't try to really check, and b) this doesn't matter.
>
> I agree, and that is why I never blamed this change in queue_delayed_work().
>
> OK, you seem to agree with the patch below, I'll write the changelog/comment
> and send it "officially".
>
> Or did I misunderstand you?

Thanks for this. Taking a while to page this back in, the intention is
for housekeeping to be done by boot CPU until house keeper is awake, so
returning smp_processor_id() seems like the right thing to do here for
ephemeral jobs like timers and work, provided that CPU / mask is not
stored somewhere long term by the caller.

For things that set an affinity like kthread, sched, maybe managed
irqs, and such.

There are not many callers of housekeeping_any_cpu() so that's easy
enough to verify. But similar like housekeeping_cpumask() and others
could be an issue or at least a foot-gun, I'm not sure how well I
convinced myself of those.

Could you test like this?

WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
type != HK_TYPE_TIMER);

With a comment to say other ephemeral mask types could be exempted if
needed.

It would also be nice to warn for cases that would be bugs if the boot
CPU was not in the HK mask. Could that be done by having a
housekeepers_online() call after smp_init() (maybe at the start of
sched_init_smp()) that could verify there is at least one online, and
set a flag that could be used to create warnings.

Thanks,
Nick

>
> Oleg.
>
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 373d42c707bc..e912555c6fc8 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -46,7 +46,11 @@ int housekeeping_any_cpu(enum hk_type type)
> if (cpu < nr_cpu_ids)
> return cpu;
>
> - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + if (cpu < nr_cpu_ids)
> + return cpu;
> +
> + WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
> }
> }
> return smp_processor_id();