Re: [Question] timers: trigger_dyntick_cpu() vs TIMER_DEFERRABLE

From: Frederic Weisbecker
Date: Mon Jul 25 2022 - 06:44:06 EST


On Mon, Jul 25, 2022 at 10:32:42AM +0100, Valentin Schneider wrote:
> Hi,
>
> I've been incidentally staring at some NOHZ bits related to the timer
> wheels, and trigger_dyntick_cpu() confuses me:
>
> static void
> trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
> {
> [...]
> /*
> * TODO: This wants some optimizing similar to the code below, but we
> * will do that when we switch from push to pull for deferrable timers.
> */
> if ((timer->flags & TIMER_DEFERRABLE)) {
> if (tick_nohz_full_cpu(base->cpu))
> wake_up_nohz_cpu(base->cpu);
> return;
> }
> [...]
> }
>
> From what I grok out of get_nohz_timer_target(), under
> timers_migration_enabled we should migrate the timer to an non-idle CPU
> (or at the very least a non-isolated CPU) *before* enqueuing the
> timer.

That's not always the case. For example TIMER_PINNED timers might have
to run on a buzy or isolated CPU.

And note that even when (base->cpu == smp_processor_id()) we want to kick
the current CPU with a self-IPI. This way we force, from IRQ-tail, the
tick to recalculate the next deadline to fire, considering the new enqueued
timer callback.

> Without timers_migration_enabled (or if TIMER_PINNED), I don't see
> anything that could migrate the timer elsewhere, so:
>
> Why bother kicking a NOHZ CPU for a deferrable timer if it is the next
> expiring one? Per the definition:
>
> * @TIMER_DEFERRABLE: A deferrable timer will work normally when the
> * system is busy, but will not cause a CPU to come out of idle just
> * to service it; instead, the timer will be serviced when the CPU
> * eventually wakes up with a subsequent non-deferrable timer.
>
> I tried to find some discussion over this in LKML, but found nothing.
> v3 of the patch did *not* kick a CPU for a deferrable timer, but v4 (the
> one that ended up merged) did (see below). Patch in question is:
>
> a683f390b93f ("timers: Forward the wheel clock whenever possible")

Because TIMER_DEFERRABLE timers should only be deferred when the CPU is
in "nohz-idle". If the CPU runs an actual task with the tick shutdown
("nohz-full"), we should execute those deferrable timers.

Now that's the theory. In practice the deferrable timers are ignored by
both nohz-idle and nohz-full when it comes to compute the next nohz delta.
This is a mistake that is there since the introduction of nohz-full but I've
always been scared to break some user setup while fixing it. Anyway things
should look like this (untested):

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index de192dcff828..5f8ef777a785 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -819,7 +819,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
* disabled this also looks at the next expiring
* hrtimer.
*/
- next_tick = get_next_timer_interrupt(basejiff, basemono);
+ next_tick = get_next_timer_interrupt(basejiff, basemono, ts->inidle);
ts->next_timer = next_tick;
}

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 717fcb9fb14a..8279d4e9b7a0 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -574,16 +574,6 @@ trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
if (!is_timers_nohz_active())
return;

- /*
- * TODO: This wants some optimizing similar to the code below, but we
- * will do that when we switch from push to pull for deferrable timers.
- */
- if (timer->flags & TIMER_DEFERRABLE) {
- if (tick_nohz_full_cpu(base->cpu))
- wake_up_nohz_cpu(base->cpu);
- return;
- }
-
/*
* We might have to IPI the remote CPU if the base is idle and the
* timer is not deferrable. If the other CPU is on the way to idle
@@ -1678,17 +1668,9 @@ static u64 cmp_next_hrtimer_event(u64 basem, u64 expires)
return DIV_ROUND_UP_ULL(nextevt, TICK_NSEC) * TICK_NSEC;
}

-/**
- * get_next_timer_interrupt - return the time (clock mono) of the next timer
- * @basej: base time jiffies
- * @basem: base time clock monotonic
- *
- * Returns the tick aligned clock monotonic time of the next pending
- * timer or KTIME_MAX if no timer is pending.
- */
-u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
+static u64 get_next_base_interrupt(struct timer_base *base,
+ unsigned long basej, u64 basem)
{
- struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
u64 expires = KTIME_MAX;
unsigned long nextevt;

@@ -1734,6 +1716,32 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
}
raw_spin_unlock(&base->lock);

+ return expires;
+}
+
+/**
+ * get_next_timer_interrupt - return the time (clock mono) of the next timer
+ * @basej: base time jiffies
+ * @basem: base time clock monotonic
+ * @idle: is the CPU idle?
+ *
+ * Returns the tick aligned clock monotonic time of the next pending
+ * timer or KTIME_MAX if no timer is pending.
+ */
+u64 get_next_timer_interrupt(unsigned long basej, u64 basem, bool idle)
+{
+ struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+ u64 expires;
+
+ expires = get_next_base_interrupt(base, basej, basem);
+ if (!idle) {
+ u64 expires_def;
+
+ base = this_cpu_ptr(&timer_bases[BASE_DEF]);
+ expires_def = get_next_base_interrupt(base, basej, basem);
+ expires = min(expires, expires_def);
+ }
+
return cmp_next_hrtimer_event(basem, expires);
}

@@ -1744,15 +1752,15 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
*/
void timer_clear_idle(void)
{
- struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
-
/*
* We do this unlocked. The worst outcome is a remote enqueue sending
* a pointless IPI, but taking the lock would just make the window for
* sending the IPI a few instructions smaller for the cost of taking
* the lock in the exit from idle path.
*/
- base->is_idle = false;
+ __this_cpu_write(timer_bases[BASE_STD].is_idle, false);
+ if (tick_nohz_full_cpu(smp_processor_id()))
+ __this_cpu_write(timer_bases[BASE_DEF].is_idle, false);
}
#endif