Re: sched/isolation: tick_take_do_timer_from_boot() calls smp_call_function_single() with irqs disabled

From: Thomas Gleixner
Date: Fri May 24 2024 - 18:06:20 EST


On Fri, May 24 2024 at 20:37, Oleg Nesterov wrote:

> I've already had a few beers today, I know I'll regret about this
> email tomorrow, but I can't resist ;)

You won't regret it. :)

> On 05/24, Frederic Weisbecker wrote:
> But again, again. tick_sched_do_timer() says
>
> * If nohz_full is enabled, this should not happen because the
> * 'tick_do_timer_cpu' CPU never relinquishes.
>
> so I guess it is not supposed to happen?

Right. It does not happen because the kernel starts with jiffies as
clocksource except on S390. The jiffies clocksource is not qualified to
switch over to NOHZ mode for obvious reasons. But even on S390 which has
a truly usable and useful clocksource the tick stays periodic to begin
with. Why?

The NOHZ ready notification happens late in the boot process via:
fs_initcall(clocksource_done_booting)

So by the time that happens, the secondary CPUs are up and have taken
over the do timer duty.

[ 0.600381] smp: Bringing up secondary CPUs ...

...

[ 1.917842] clocksource: Switched to clocksource kvm-clock
[ 1.918548] clocksource_done_booting: Switched to NOHZ // debug printk

This is the point where tick_nohz_activate() is called first time and
that does:

tick_sched_flag_set(ts, TS_FLAG_NOHZ);

So up to this point the tick is never stopped neither on housekeeping
nor on NOHZ FULL CPUs:

tick_nohz_full_update_tick()
if (!tick_sched_flag_test(ts, TS_FLAG_NOHZ))
return;

> And. My main question was: how can smp_call_function_single() help???

It's useless.

> Why do we actually need it?

We do not.

As explained above there is also nothing extra to fix contrary to
Frederics fears.

Even in the case that a command line limitation restricts the number of
CPUs such that there is no housekeeping CPU onlined during
smp_init(). That is checked in the isolation init code which clears
nohz_full_running in that case. Nothing to see there either.

So all this needs is the simple:

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index d88b13076b79..dab17d756fd8 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -229,11 +209,9 @@ static void tick_setup_device(struct tick_device *td,
if (tick_nohz_full_cpu(cpu))
tick_do_timer_boot_cpu = cpu;

- } else if (tick_do_timer_boot_cpu != -1 &&
- !tick_nohz_full_cpu(cpu)) {
- tick_take_do_timer_from_boot();
+ } else if (tick_do_timer_boot_cpu != -1 && !tick_nohz_full_cpu(cpu)) {
+ WRITE_ONCE(tick_do_timer_cpu, cpu);
tick_do_timer_boot_cpu = -1;
- WARN_ON(READ_ONCE(tick_do_timer_cpu) != cpu);
#endif
}

along with the removal of the SMP function call voodoo programming gunk,
a lengthy changelog and a bunch of useful comments.

Changing the horribly lazy and incomprehensible '-1' to an actual
meaningful define, e.g. TICK_DO_TIMER_NONE, would definitely help along
with renaming the variable to tick_do_timer_nohz_full_boot_cpu.

There is no race other than the boot CPU reading tick_do_timer_cpu
concurrently to the update, but that's completely harmless whatever it
sees there. If it's the boot CPU, i.e. 0, or the secondary does not
matter. The secondary immediately schedules the tick unconditionally so
timekeeping and jiffies will just work.

If the secondary CPU fails to come up after it installed the clock event
device then the missing tick is the least of the problems.

That has absolutely nothing to do with the issue at hand. If the CPU
which owns tick_do_timer_cpu dies or gets stuck then all bets are off
independent of NOHZ FULL. See the changes which went in during the merge
window to handle the case where the hypervisor fails to inject the timer
interrupts or keeps the time keeper duty CPU scheduled out for a long
period of time....

Thanks,

tglx