[patch 00/48] hrtimer,sched: General optimizations and hrtick enablement

From: Thomas Gleixner

Date: Tue Feb 24 2026 - 11:35:22 EST


Peter recently posted a series tweaking the hrtimer subsystem to reduce the
overhead of the scheduler hrtick timer so it can be enabled by default:

https://lore.kernel.org/20260121162010.647043073@xxxxxxxxxxxxx

That turned out to be incomplete and led to a deeper investigation of the
related bits and pieces.

The problem is that the hrtick deadline changes on every context switch and
is also modified by wakeups and balancing. On a hackbench run this results
in about 2500 clockevent reprogramming cycles per second, which is
especially hurtful in a VM as accessing the clockevent device implies a
VM-Exit.

The following series addresses various aspects of the overall related
problem space:

1) Scheduler

Aside of some trivial fixes the handling of the hrtick timer in
the scheduler is suboptimal:

- schedule() modifies the hrtick when picking the next task

- schedule() can modify the hrtick when the balance callback runs
before releasing rq:lock

- the expiry time is unfiltered and can result in really tiny
changes of the expiry time, which are functionally completely
irrelevant

Solve this by deferring the hrtick update to the end of schedule()
and filtering out tiny changes.


2) Clocksource, clockevents, timekeeping

- Reading the current clocksource involves an indirect call, which
is expensive especially for clocksources where the actual read is
a single instruction like the TSC read on x86.

This could be solved with a static call, but the architecture
coverage for static calls is meager and that still has the
overhead of a function call and in the worst case a return
speculation mitigation.

As x86 and other architectures like S390 have one preferred
clocksource which is normally used on all contemporary systems,
this begs for a fully inlined solution.

This is achieved by a config option which tells the core code to
use the architecture provided inline guarded by a static branch.

If the branch is disabled, the indirect function call is used as
before. If enabled the inlined read is utilized.

The branch is disabled by default and only enabled after a
clocksource is installed which has the INLINE feature flag
set. When the clocksource is replaced the branch is disabled
before the clocksource change happens.


- Programming clock events is based on calculating a relative
expiry time, converting it to the clock cycles corresponding to
the clockevent device frequency and invoking the set_next_event()
callback of the clockevent device.

That works perfectly fine as most hardware timers are count down
implementations which require a relative time for programming.

But clockevent devices which are coupled to the clocksource and
provide a less than equal comparator suffer from this scheme. The
core calculates the relative expiry time based on a clock read
and the set_next_event() callback has to read the same clock
again to convert it back to a absolute time which can be
programmed into the comparator.

The other issue is that the conversion factor of the clockevent
device is calculated at boot time and does not take the NTP/PTP
adjustments of the clocksource frequency into account. Depending
on the direction of the adjustment this can cause timers to fire
early or late. Early is the more problematic case as the timer
interrupt has to reprogram the device with a very short delta as
it can't expire timers early.

This can be optimized by introducing a 'coupled' mode for the
clocksource and the clockevent device.

A) If the clocksource indicates support for 'coupled' mode, the
timekeeping core calculates a (NTP adjusted) reverse
conversion factor from the clocksource to nanoseconds
conversion. This takes NTP adjustments into account and
keeps the conversion in sync.

B) The timekeeping core provides a function to convert an
absolute CLOCK_MONOTONIC expiry time into a absolute time in
clocksource cycles which can be programmed directly into the
comparator without reading the clocksource at all.

This is possible because timekeeping keeps a time pair of
the base cycle count and the corresponding CLOCK_MONOTONIC base
time at the last update of the timekeeper.

So the absolute cycle time can be calculated by calculating
the relative time to the CLOCK_MONOTONIC base time,
converting the delta into cycles with the help of #A and
adding the base cycle count. Pure math, no hardware access.

C) The clockevent reprogramming code invokes this conversion
function when the clockevent device indicates 'coupled'
mode. The function returns false when the corresponding
clocksource is not the current system clocksource (based on
a clocksource ID check) and true if the clocksource matches
and the conversion is successful.

If false, the regular relative set_next_event() mechanism is
used, otherwise a new set_next_coupled() callback which
takes the calculated absolute expiry time as argument.

Similar to the clocksource, this new callback can optionally
be inlined.


3) hrtimers

It turned out that the hrtimer code needed a long overdue spring
cleaning independent of the problem at hand. That was conducted
before tackling the actual performance issues:

- Timer locality

The handling of timer locality is suboptimal and results often in
pointless invocations of switch_hrtimer_base() which end up
keeping the CPU base unchanged.

Aside of the pointless overhead, this prevents further
optimizations for the common local case.

Address this by improving the decision logic for keeping the clock
base local and splitting out the (re)arm handling into a unified
operation.


- Evalutation of the clock base expiries

The clock bases (MONOTONIC, REALTIME, BOOT, TAI) cache the first
expiring timer, but not the corresponding expiry time, which means
a re-evaluation of the clock bases for the next expiring timer on
the CPU requires to touch up to for extra cache lines.

Trivial to solve by caching the earliest expiry time in the clock
base itself.


- Reprogramming of the clock event device

The hrtimer interrupt already deferres reprogramming until the
interrupt handler completes, but in case of the hrtick timer
that's not sufficient because the hrtick timer callback only sets
the NEED_RESCHED flag but has no information about the next hrtick
timer expiry time, which can only be determined in the scheduler.

Expand the deferred reprogramming so it can ideally be handled in
the subsequent schedule() after the new hrtick value has been
established. If there is no schedule, soft interrupts have to be
processed on return from interrupt or a nested interrupt hits
before reaching schedule, the deferred programming is handled in
those contexts.


- Modification of queued timers

If a timer is already queued modifying the expiry time requires
dequeueing from the RB tree and requeuing after the new expiry
value has been updated. It turned out that the hrtick timer
modification end up very often at the same spot in the RB tree as
they have been before, which means the dequeue/enqueue cycle along
with the related rebalancing could have been avoided. The timer
wheel timers have a similar mechanism by checking upfront whether
the resulting expiry time keeps them in the same hash bucket.

It was tried to check this by using rb_prev() and rb_next() to
evaluate whether the modification keeps the timer in the same
spot, but that turned out to be really inefficent.

Solve this by providing a RB tree variant which extends the node
with links to the previous and next nodes, which is established
when the node is linked into the tree or adjusted when it is
removed. These links allow a quick peek into the previous and next
expiry time and if the new expiry stays in the boundary the whole
RB tree operation can be avoided.

This also simplifies the caching and update of the leftmost node
as on remove the rb_next() walk can be completely avoided. It
would obviously provide a cached rightmost pointer too, but there
is not use case for that (yet).

On a hackbench run this results in about 35% of the updates being
handled that way, which cuts the execution time of
hrtimer_start_range_ns() down to 50ns on a 2GHz machine.


- Cancellation of queued timers

Cancelling a timer or moving its expiry time past the programmed
time can result in reprogramming the clock event device.
Especially with frequent modifications of a queued timer this
results in substantial overhead especially in VMs.

Provide an option for hrtimers to tell the core to handle
reprogramming lazy in those cases, which means it trades frequent
reprogramming against an occasional pointless hrtimer interrupt.

But it turned out for the hrtick timer this is a reasonable
tradeoff. It's especially valuable when transitioning to idle,
where the timer has to be cancelled but then the NOHZ idle code
will reprogram it in case of a long idle sleep anyway. But also in
high frequency scheduling scenarios this turned out to be
beneficial.


With all the above modifications in place enabling hrtick does not longer
result in regressions compared to the hrtick disabled mode.

The reprogramming frequency of the clockevent device got down from
~2500/sec to ~100/sec for a hackbench run with a spurious hrtimer interrupt
ratio of about 25%.

What's interesting is the astonishing improvement of a hackbench run with
the following command line parameters: '-l$LOOPS -p -s8'. That uses pipes
with a message size of 8 bytes. On a 112 CPU SKL machine this results in:

NO HRTICK[_DL] HRTICK[_DL]
runtime: 0.840s 0.481s ~-42%

With other message sizes up to 256, HRTICK still results in improvements,
but not in that magnitude. Haven't investigated the cause of that yet.

While quite some parts of the series are independent enhancements, I've
decided to keep them together in one big pile for now as all of the
components are required to actually achieve the overall goal.

The patches have been already structured in a way that they can be
distributed to different subsystem branches without causing major cross
subsystem contamination or merge conflict headaches.

The series applies on v7.0-rc1 and is also available from git:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched/hrtick

Thanks,

tglx
---
arch/x86/Kconfig | 2
arch/x86/include/asm/clock_inlined.h | 22
arch/x86/kernel/apic/apic.c | 41 -
arch/x86/kernel/tsc.c | 4
include/asm-generic/thread_info_tif.h | 5
include/linux/clockchips.h | 8
include/linux/clocksource.h | 3
include/linux/hrtimer.h | 59 -
include/linux/hrtimer_defs.h | 79 +-
include/linux/hrtimer_rearm.h | 83 ++
include/linux/hrtimer_types.h | 19
include/linux/irq-entry-common.h | 25
include/linux/rbtree.h | 81 ++
include/linux/rbtree_types.h | 16
include/linux/rseq_entry.h | 14
include/linux/timekeeper_internal.h | 8
include/linux/timerqueue.h | 56 +
include/linux/timerqueue_types.h | 15
include/trace/events/timer.h | 35 -
kernel/entry/common.c | 4
kernel/sched/core.c | 89 ++
kernel/sched/deadline.c | 2
kernel/sched/fair.c | 55 -
kernel/sched/features.h | 5
kernel/sched/sched.h | 41 -
kernel/softirq.c | 15
kernel/time/Kconfig | 16
kernel/time/clockevents.c | 48 +
kernel/time/hrtimer.c | 1116 +++++++++++++++++++---------------
kernel/time/tick-broadcast-hrtimer.c | 1
kernel/time/tick-sched.c | 27
kernel/time/timekeeping.c | 184 +++++
kernel/time/timekeeping.h | 2
kernel/time/timer_list.c | 12
lib/rbtree.c | 17
lib/timerqueue.c | 14
36 files changed, 1497 insertions(+), 728 deletions(-)