Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

From: Sebastian Siewior
Date: Wed Dec 06 2023 - 11:35:54 EST


On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:

> As long as a CPU is busy it expires both local and global timers. When a
> CPU goes idle it arms for the first expiring local timer. If the first
> expiring pinned (local) timer is before the first expiring movable timer,
> then no action is required because the CPU will wake up before the first
> movable timer expires. If the first expiring movable timer is before the
> first expiring pinned (local) timer, then this timer is queued into a idle
an
> timerqueue and eventually expired by some other active CPU.
s/some other/another ?


>
> Signed-off-by: Anna-Maria Behnsen <anna-maria@xxxxxxxxxxxxx>
> ---
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index b6c9ac0c3712..ac3e888d053f 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -2103,6 +2104,64 @@ void timer_lock_remote_bases(unsigned int cpu)

> +static void timer_use_tmigr(unsigned long basej, u64 basem,
> + unsigned long *nextevt, bool *tick_stop_path,
> + bool timer_base_idle, struct timer_events *tevt)
> +{
> + u64 next_tmigr;
> +
> + if (timer_base_idle)
> + next_tmigr = tmigr_cpu_new_timer(tevt->global);
> + else if (tick_stop_path)
> + next_tmigr = tmigr_cpu_deactivate(tevt->global);
> + else
> + next_tmigr = tmigr_quick_check();
> +
> + /*
> + * If the CPU is the last going idle in timer migration hierarchy, make
> + * sure the CPU will wake up in time to handle remote timers.
> + * next_tmigr == KTIME_MAX if other CPUs are still active.
> + */
> + if (next_tmigr < tevt->local) {
> + u64 tmp;
> +
> + /* If we missed a tick already, force 0 delta */
> + if (next_tmigr < basem)
> + next_tmigr = basem;
> +
> + tmp = div_u64(next_tmigr - basem, TICK_NSEC);

Is this considered a hot path? Asking because u64 divs are nice if can
be avoided ;)

I guess the original value is from fetch_next_timer_interrupt(). But
then you only need it if the caller (__get_next_timer_interrupt()) has
the `idle' value set. Otherwise the operation is pointless.
Would it somehow work to replace
base_local->is_idle = time_after(nextevt, basej + 1);

with maybe something like
base_local->is_idle = tevt.local > basem + TICK_NSEC

If so you could avoid the `nextevt' maneuver.

> + *nextevt = basej + (unsigned long)tmp;
> + tevt->local = next_tmigr;
> + }
> +}
> +# else

> @@ -2132,6 +2190,21 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
> nextevt = fetch_next_timer_interrupt(basej, basem, base_local,
> base_global, &tevt);
>
> + /*
> + * When the when the next event is only one jiffie ahead there is no

If the next event is only one jiffy ahead then there is no

> + * need to call timer migration hierarchy related
> + * functions. @tevt->global will be KTIME_MAX, nevertheless if the next
> + * timer is a global timer. This is also true, when the timer base is

The second sentence is hard to parse.

> + * idle.
> + *
> + * The proper timer migration hierarchy function depends on the callsite
> + * and whether timer base is idle or not. @nextevt will be updated when
> + * this CPU needs to handle the first timer migration hierarchy event.
> + */
> + if (time_after(nextevt, basej + 1))
> + timer_use_tmigr(basej, basem, &nextevt, idle,
> + base_local->is_idle, &tevt);
> +
> /*
> * We have a fresh next event. Check whether we can forward the
> * base.
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@

> +/*
> + * The timer migration mechanism is built on a hierarchy of groups. The
> + * lowest level group contains CPUs, the next level groups of CPU groups
> + * and so forth. The CPU groups are kept per node so for the normal case
> + * lock contention won't happen across nodes. Depending on the number of
> + * CPUs per node even the next level might be kept as groups of CPU groups
> + * per node and only the levels above cross the node topology.
> + *
> + * Example topology for a two node system with 24 CPUs each.
> + *
> + * LVL 2 [GRP2:0]
> + * GRP1:0 = GRP1:M
> + *
> + * LVL 1 [GRP1:0] [GRP1:1]
> + * GRP0:0 - GRP0:2 GRP0:3 - GRP0:5
> + *
> + * LVL 0 [GRP0:0] [GRP0:1] [GRP0:2] [GRP0:3] [GRP0:4] [GRP0:5]
> + * CPUS 0-7 8-15 16-23 24-31 32-39 40-47

In the CPUS list between 24-31 and 32-39 is a tab while the other
separators are spaces. Could you please align it with spaces? Judging
form the top you have tabstop=8 but here tabstop=4 looks "nice".

> + *
> + * The groups hold a timer queue of events sorted by expiry time. These
> + * queues are updated when CPUs go in idle. When they come out of idle
> + * ignore flag of events is set.
> + *

Sebastian