Re: [PATCH] timers/migration: Fix livelock in tmigr_handle_remote_up()
From: Frederic Weisbecker
Date: Thu Jun 04 2026 - 05:55:32 EST
Le Wed, Jun 03, 2026 at 05:01:39PM +0000, Amit Matityahu a écrit :
> tmigr_handle_remote_cpu() skips timer_expire_remote() when cpu ==
> smp_processor_id(), assuming the local softirq path already handled
> this CPU's timers.
>
> This assumption breaks when jiffies advances between
> run_timer_base(BASE_GLOBAL) and tmigr_handle_remote() in the same
> softirq invocation - a timer expires after the wheel ran but before
> the hierarchy snapshot is taken.
>
> The stranded timer is never collected,
> fetch_next_timer_interrupt_remote() keeps reporting it as expired,
> and the event is re-queued with expires == now on each iteration.
> The goto-again loop spins indefinitely.
>
> Fix by calling timer_expire_remote() unconditionally.
> __run_timer_base() already returns early when there is nothing to
> expire, making this a no-op in the common case.
>
> Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
> Cc: stable@xxxxxxxxxxxxxxx
> Reported-by: Alon Kariv <alonka@xxxxxxxxxx>
> Cc: Jonathan Chocron <jonnyc@xxxxxxxxxx>
> Cc: Akram Baransi <abaransi@xxxxxxxxxx>
> Cc: David Woodhouse <dwmw@xxxxxxxxxxxx>
> Signed-off-by: Amit Matityahu <amitmat@xxxxxxxxxx>
That's quite serious indeed!
> ---
>
> Questions for maintainers:
>
> 1. What was the original rationale for the cpu != smp_processor_id()
> check? There is no code comment, commit message explanation or anything
> in the original patch's email discussion as to why
> timer_expire_remote() is skipped for the local CPU.
The rationale was about assuming that such an expired timerqueue actually
reflected a timer that was handled locally already and so it could be safely
discarded. So we could spare some locking.
>
> 2. There seems to be a design tension where a CPU can have timers
> visible in the migration hierarchy while simultaneously running its
> own local softirq. Is the expectation that run_timer_base() always
> drains everything before tmigr_handle_remote() sees it, or should
> the remote path handle local-CPU timers as a fallback?
That's not easy to defer all global timers handling to remote expiration
because the current CPU may or may not be the migrator.
>
> kernel/time/timer_migration.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 1d0d3a4058d5..298c34c942ae 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -978,8 +978,7 @@ static void tmigr_handle_remote_cpu(unsigned int cpu, u64 now,
> /* Drop the lock to allow the remote CPU to exit idle */
> raw_spin_unlock_irq(&tmc->lock);
>
> - if (cpu != smp_processor_id())
> - timer_expire_remote(cpu);
> + timer_expire_remote(cpu);
Reviewed-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
Thanks!
>
> /*
> * Lock ordering needs to be preserved - timer_base locks before tmigr
>
> base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
> --
> 2.47.3
>
--
Frederic Weisbecker
SUSE Labs