Re: [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement

From: Christian Loehle

Date: Wed Mar 04 2026 - 11:18:28 EST

On 2/24/26 16:35, Thomas Gleixner wrote:
> Peter recently posted a series tweaking the hrtimer subsystem to reduce the
> overhead of the scheduler hrtick timer so it can be enabled by default:
>
> https://lore.kernel.org/20260121162010.647043073@xxxxxxxxxxxxx
>
> That turned out to be incomplete and led to a deeper investigation of the
> related bits and pieces.
>
> The problem is that the hrtick deadline changes on every context switch and
> is also modified by wakeups and balancing. On a hackbench run this results
> in about 2500 clockevent reprogramming cycles per second, which is
> especially hurtful in a VM as accessing the clockevent device implies a
> VM-Exit.
>
> The following series addresses various aspects of the overall related
> problem space:
>
> 1) Scheduler
>
> Aside of some trivial fixes the handling of the hrtick timer in
> the scheduler is suboptimal:
>
> - schedule() modifies the hrtick when picking the next task
>
> - schedule() can modify the hrtick when the balance callback runs
> before releasing rq:lock
>
> - the expiry time is unfiltered and can result in really tiny
> changes of the expiry time, which are functionally completely
> irrelevant
>
> Solve this by deferring the hrtick update to the end of schedule()
> and filtering out tiny changes.
>
>
> 2) Clocksource, clockevents, timekeeping
>
> - Reading the current clocksource involves an indirect call, which
> is expensive especially for clocksources where the actual read is
> a single instruction like the TSC read on x86.
>
> This could be solved with a static call, but the architecture
> coverage for static calls is meager and that still has the
> overhead of a function call and in the worst case a return
> speculation mitigation.
>
> As x86 and other architectures like S390 have one preferred
> clocksource which is normally used on all contemporary systems,
> this begs for a fully inlined solution.
>
> This is achieved by a config option which tells the core code to
> use the architecture provided inline guarded by a static branch.
>
> If the branch is disabled, the indirect function call is used as
> before. If enabled the inlined read is utilized.
>
> The branch is disabled by default and only enabled after a
> clocksource is installed which has the INLINE feature flag
> set. When the clocksource is replaced the branch is disabled
> before the clocksource change happens.
>
>
> - Programming clock events is based on calculating a relative
> expiry time, converting it to the clock cycles corresponding to
> the clockevent device frequency and invoking the set_next_event()
> callback of the clockevent device.
>
> That works perfectly fine as most hardware timers are count down
> implementations which require a relative time for programming.
>
> But clockevent devices which are coupled to the clocksource and
> provide a less than equal comparator suffer from this scheme. The
> core calculates the relative expiry time based on a clock read
> and the set_next_event() callback has to read the same clock
> again to convert it back to a absolute time which can be
> programmed into the comparator.
>
> The other issue is that the conversion factor of the clockevent
> device is calculated at boot time and does not take the NTP/PTP
> adjustments of the clocksource frequency into account. Depending
> on the direction of the adjustment this can cause timers to fire
> early or late. Early is the more problematic case as the timer
> interrupt has to reprogram the device with a very short delta as
> it can't expire timers early.
>
> This can be optimized by introducing a 'coupled' mode for the
> clocksource and the clockevent device.
>
> A) If the clocksource indicates support for 'coupled' mode, the
> timekeeping core calculates a (NTP adjusted) reverse
> conversion factor from the clocksource to nanoseconds
> conversion. This takes NTP adjustments into account and
> keeps the conversion in sync.
>
> B) The timekeeping core provides a function to convert an
> absolute CLOCK_MONOTONIC expiry time into a absolute time in
> clocksource cycles which can be programmed directly into the
> comparator without reading the clocksource at all.
>
> This is possible because timekeeping keeps a time pair of
> the base cycle count and the corresponding CLOCK_MONOTONIC base
> time at the last update of the timekeeper.
>
> So the absolute cycle time can be calculated by calculating
> the relative time to the CLOCK_MONOTONIC base time,
> converting the delta into cycles with the help of #A and
> adding the base cycle count. Pure math, no hardware access.
>
> C) The clockevent reprogramming code invokes this conversion
> function when the clockevent device indicates 'coupled'
> mode. The function returns false when the corresponding
> clocksource is not the current system clocksource (based on
> a clocksource ID check) and true if the clocksource matches
> and the conversion is successful.
>
> If false, the regular relative set_next_event() mechanism is
> used, otherwise a new set_next_coupled() callback which
> takes the calculated absolute expiry time as argument.
>
> Similar to the clocksource, this new callback can optionally
> be inlined.
>
>
> 3) hrtimers
>
> It turned out that the hrtimer code needed a long overdue spring
> cleaning independent of the problem at hand. That was conducted
> before tackling the actual performance issues:
>
> - Timer locality
>
> The handling of timer locality is suboptimal and results often in
> pointless invocations of switch_hrtimer_base() which end up
> keeping the CPU base unchanged.
>
> Aside of the pointless overhead, this prevents further
> optimizations for the common local case.
>
> Address this by improving the decision logic for keeping the clock
> base local and splitting out the (re)arm handling into a unified
> operation.
>
>
> - Evalutation of the clock base expiries
>
> The clock bases (MONOTONIC, REALTIME, BOOT, TAI) cache the first
> expiring timer, but not the corresponding expiry time, which means
> a re-evaluation of the clock bases for the next expiring timer on
> the CPU requires to touch up to for extra cache lines.
>
> Trivial to solve by caching the earliest expiry time in the clock
> base itself.
>
>
> - Reprogramming of the clock event device
>
> The hrtimer interrupt already deferres reprogramming until the
> interrupt handler completes, but in case of the hrtick timer
> that's not sufficient because the hrtick timer callback only sets
> the NEED_RESCHED flag but has no information about the next hrtick
> timer expiry time, which can only be determined in the scheduler.
>
> Expand the deferred reprogramming so it can ideally be handled in
> the subsequent schedule() after the new hrtick value has been
> established. If there is no schedule, soft interrupts have to be
> processed on return from interrupt or a nested interrupt hits
> before reaching schedule, the deferred programming is handled in
> those contexts.
>
>
> - Modification of queued timers
>
> If a timer is already queued modifying the expiry time requires
> dequeueing from the RB tree and requeuing after the new expiry
> value has been updated. It turned out that the hrtick timer
> modification end up very often at the same spot in the RB tree as
> they have been before, which means the dequeue/enqueue cycle along
> with the related rebalancing could have been avoided. The timer
> wheel timers have a similar mechanism by checking upfront whether
> the resulting expiry time keeps them in the same hash bucket.
>
> It was tried to check this by using rb_prev() and rb_next() to
> evaluate whether the modification keeps the timer in the same
> spot, but that turned out to be really inefficent.
>
> Solve this by providing a RB tree variant which extends the node
> with links to the previous and next nodes, which is established
> when the node is linked into the tree or adjusted when it is
> removed. These links allow a quick peek into the previous and next
> expiry time and if the new expiry stays in the boundary the whole
> RB tree operation can be avoided.
>
> This also simplifies the caching and update of the leftmost node
> as on remove the rb_next() walk can be completely avoided. It
> would obviously provide a cached rightmost pointer too, but there
> is not use case for that (yet).
>
> On a hackbench run this results in about 35% of the updates being
> handled that way, which cuts the execution time of
> hrtimer_start_range_ns() down to 50ns on a 2GHz machine.
>
>
> - Cancellation of queued timers
>
> Cancelling a timer or moving its expiry time past the programmed
> time can result in reprogramming the clock event device.
> Especially with frequent modifications of a queued timer this
> results in substantial overhead especially in VMs.
>
> Provide an option for hrtimers to tell the core to handle
> reprogramming lazy in those cases, which means it trades frequent
> reprogramming against an occasional pointless hrtimer interrupt.
>
> But it turned out for the hrtick timer this is a reasonable
> tradeoff. It's especially valuable when transitioning to idle,
> where the timer has to be cancelled but then the NOHZ idle code
> will reprogram it in case of a long idle sleep anyway. But also in
> high frequency scheduling scenarios this turned out to be
> beneficial.
>
>
> With all the above modifications in place enabling hrtick does not longer
> result in regressions compared to the hrtick disabled mode.
>
> The reprogramming frequency of the clockevent device got down from
> ~2500/sec to ~100/sec for a hackbench run with a spurious hrtimer interrupt
> ratio of about 25%.
>
> What's interesting is the astonishing improvement of a hackbench run with
> the following command line parameters: '-l$LOOPS -p -s8'. That uses pipes
> with a message size of 8 bytes. On a 112 CPU SKL machine this results in:
>
> NO HRTICK[_DL] HRTICK[_DL]
> runtime: 0.840s 0.481s ~-42%
>
> With other message sizes up to 256, HRTICK still results in improvements,
> but not in that magnitude. Haven't investigated the cause of that yet.
>
> While quite some parts of the series are independent enhancements, I've
> decided to keep them together in one big pile for now as all of the
> components are required to actually achieve the overall goal.
>
> The patches have been already structured in a way that they can be
> distributed to different subsystem branches without causing major cross
> subsystem contamination or merge conflict headaches.
>
> The series applies on v7.0-rc1 and is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched/hrtick
>
> Thanks,
>
> tglx
> ---
> arch/x86/Kconfig | 2
> arch/x86/include/asm/clock_inlined.h | 22
> arch/x86/kernel/apic/apic.c | 41 -
> arch/x86/kernel/tsc.c | 4
> include/asm-generic/thread_info_tif.h | 5
> include/linux/clockchips.h | 8
> include/linux/clocksource.h | 3
> include/linux/hrtimer.h | 59 -
> include/linux/hrtimer_defs.h | 79 +-
> include/linux/hrtimer_rearm.h | 83 ++
> include/linux/hrtimer_types.h | 19
> include/linux/irq-entry-common.h | 25
> include/linux/rbtree.h | 81 ++
> include/linux/rbtree_types.h | 16
> include/linux/rseq_entry.h | 14
> include/linux/timekeeper_internal.h | 8
> include/linux/timerqueue.h | 56 +
> include/linux/timerqueue_types.h | 15
> include/trace/events/timer.h | 35 -
> kernel/entry/common.c | 4
> kernel/sched/core.c | 89 ++
> kernel/sched/deadline.c | 2
> kernel/sched/fair.c | 55 -
> kernel/sched/features.h | 5
> kernel/sched/sched.h | 41 -
> kernel/softirq.c | 15
> kernel/time/Kconfig | 16
> kernel/time/clockevents.c | 48 +
> kernel/time/hrtimer.c | 1116 +++++++++++++++++++---------------
> kernel/time/tick-broadcast-hrtimer.c | 1
> kernel/time/tick-sched.c | 27
> kernel/time/timekeeping.c | 184 +++++
> kernel/time/timekeeping.h | 2
> kernel/time/timer_list.c | 12
> lib/rbtree.c | 17
> lib/timerqueue.c | 14
> 36 files changed, 1497 insertions(+), 728 deletions(-)
>
>
>

FWIW I tested various workloads for this on an arm64 rk3399 comparing
mainline NO_HRTICK
mainline HRTICK
rearm NO_HRTICK
rearm HRTICK
rearm being $SUBJECT + arm64 generic entry + enabling generic TIF bits.
https://lore.kernel.org/lkml/20260203133728.848283-1-ruanjinjie@xxxxxxxxxx/

There's nothing statistically significant with 1000HZ (it has 6 CPUs, so base
slice granularity is 2.1ms).
With 250HZ I get at least something, a selection:
+-------------+---------------------+---------------------+----------------------+----------------------+----------------------+----------------------+
| Test | mainline NO_HRTICK | mainline HRTICK | rearm NO_HRTICK | rearm HRTICK | subject NO_HRTICK | subject HRTICK |
+-------------+---------------------+---------------------+----------------------+----------------------+----------------------+----------------------+
| schbench | 306.83 ± 3.10 | 301.81 ± 1.07 | 298.67 ± 3.33 | (304.87 ± 3.29) | (305.79 ± 3.64) | (307.07 ± 1.05) |
| ebizzy | 10664 ± 19 | (10565 ± 285) | (10510 ± 245) | (10580 ± 240) | (10674 ± 259) | 10816 ± 27 |
| hackbench | 19.715 ± 0.11 | (19.707 ± 0.10) | (19.826 ± 0.15) | (19.81 ± 0.12) | 19.98 ± 0.10 | (19.74 ± 0.11) |
| nullb0 IOPS | 102525 ± 367 | (101850 ± 262) | 92209 ± 7624 | (103385 ± 422) | (101854 ± 473) | (102141 ± 149) |
+-------------+----------------------+--------------------+----------------------+----------------------+----------------------+----------------------+
(subject is $SUBJECT only, so no REARM_DEFERRED on arm64).
but at least no regression with sched_feat HRTICK.