[patch 00/12] hrtimers: Prevent hrtimer interrupt starvation

From: Thomas Gleixner

Date: Tue Apr 07 2026 - 04:54:37 EST


Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
up in user space:

https://lore.kernel.org/lkml/acMe-QZUel-bBYUh@xxxxxxxxxxxxx/

He provided a reproducer, which sets up a timerfd based timer and then
rearms it in a loop with an absolute expiry time of 1ns.

As the expiry time is in the past, the timer ends up as the first expiring
timer in the per CPU hrtimer base and the clockevent device is programmed
with the minimum delta value. If the machine is fast enough, this ends up
in a endless loop of programming the delta value to the minimum value
defined by the clock event device, before the timer interrupt can fire,
which starves the interrupt and consequently triggers the lockup detector
because the hrtimer callback of the lockup mechanism is never invoked.

The first patch in the series changes the clockevent set next event
mechanism to prevent reprogramming of the clockevent device when the
minimum delta value was programmed unless the new delta is larger than
that. It's a less convoluted variant of the patch which was posted in the
above linked thread and was confirmed to prevent the starvation problem.

But that's only to be considered the last resort because it results in an
insane amount of avoidable hrtimer interrupts.

The problem of user controlled timers is that the input value is only
sanity checked vs. validity of the provided timespec and clamped to be in
the maximum allowable range. But for performance reasons for in kernel
usage there is no check whether a to be armed timer might have been expired
already at enqueue time.

The rest of the series addresses this by providing a separate interface to
arm user controlled timers. This works the same way as the existing
hrtimer_start_range_ns(), but in case that the timer ends up as the first
timer in the clock base after enqueue it provides additional checks:

- Whether the timer becomes the first expiring timer in the CPU base.

If not the timer is considered to expire in the future as there is
already an earlier event programmed.

- Whether the timer has expired already by comparing the expiry value
against current time.

If it is expired, the timer is removed from the clock base and the
function returns false, so that the caller can handle it. That's
required because the function cannot invoke the callback as that
might need to acquire a lock which is held by the caller.

This function is then used for the user controlled timer arming interfaces
mainly by converting hrtimer sleeper over to it. That affects a few in
kernel users too, but the overhead is minimal in that case and it spares a
tedious whack the mole game all over the tree.

The other usage sites in posixtimers, alarmtimers and timerfd are converted
as well, which should cover the vast majority of user space controllable
timers as far as my investigation goes.

The series applies against Linux tree and is also available from git:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git hrtimer-exp-v1

There needs to be some discussion about the scope of backporting. The first
patch preventing the stall is obviously a backport candidate. The remaining
series can be obviously argued about, but in my opinion it should be
backported as well as it prevents stupid or malicious user space from
generating tons of pointless timer interrupts.

Thanks,

tglx
---
drivers/power/supply/charger-manager.c | 12 +-
fs/timerfd.c | 115 +++++++++++++++-----------
include/linux/alarmtimer.h | 9 +-
include/linux/clockchips.h | 2
include/linux/hrtimer.h | 20 +++-
include/trace/events/timer.h | 13 +++
kernel/time/alarmtimer.c | 70 +++++++---------
kernel/time/clockevents.c | 23 +++--
kernel/time/hrtimer.c | 142 +++++++++++++++++++++++++++++----
kernel/time/posix-cpu-timers.c | 18 ++--
kernel/time/posix-timers.c | 35 +++++---
kernel/time/posix-timers.h | 4
kernel/time/tick-common.c | 1
kernel/time/tick-sched.c | 1
net/netfilter/xt_IDLETIMER.c | 24 ++++-
15 files changed, 341 insertions(+), 148 deletions(-)