Re: [PATCH v2] tick/sched: Limit non-timekeeper CPUs calling jiffies update

From: Shrikanth Hegde

Date: Tue Oct 28 2025 - 11:22:56 EST

On 10/28/25 7:54 PM, Steve Wahl wrote:

On Tue, Oct 28, 2025 at 11:39:30AM +0530, Shrikanth Hegde wrote:

On 10/28/25 12:04 AM, Steve Wahl wrote:

On large NUMA systems, while running a test program that saturates the
inter-processor and inter-NUMA links, acquiring the jiffies_lock can
be very expensive. If the cpu designated to do jiffies updates
(tick_do_timer_cpu) gets delayed and other cpus decide to do the
jiffies update themselves, a large number of them decide to do so at
the same time. The inexpensive check against tick_next_period is far
quicker than actually acquiring the lock, so most of these get in line
to obtain the lock. If obtaining the lock is slow enough, this
spirals into the vast majority of CPUs continuously being stuck
waiting for this lock, just to obtain it and find out that time has
already been updated by another cpu. For example, on one random entry
to kdb by manually-injected NMI, I saw 2912 of 3840 cpus stuck here.

To avoid this, allow only one non-timekeeper CPU to call
tick_do_update_jiffies64() at any given time, resetting ts->stalled
jiffies only if the jiffies update function is actually called.

With this change, manually interrupting the test I find at most two
CPUs in the tick_do_update_jiffies64 function (the timekeeper and one
other).

Signed-off-by: Steve Wahl <steve.wahl@xxxxxxx>
---

v2: Rewritten to use an atomic to gate non-timekeeping cpus calling the
jiffies update, as suggested by tglx. Title of patch has changed
since trylock is no longer used.

v1 discussion:
https://lore.kernel.org/all/20251013150959.298288-1-steve.wahl@xxxxxxx/

kernel/time/tick-sched.c | 30 ++++++++++++++++++++++++++----
1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c527b421c865..3ff3eb1f90d0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -201,6 +201,27 @@ static inline void tick_sched_flag_clear(struct tick_sched *ts,
ts->flags &= ~flag;
}
+/*
+ * Allow only one non-timekeeper CPU at a time update jiffies from
+ * the timer tick.
+ *
+ * Returns true if update was run.
+ */
+static bool tick_limited_update_jiffies64(struct tick_sched *ts, ktime_t now)
+{
+ static atomic_t in_progress;
+ int inp;
+
+ inp = atomic_read(&in_progress);
+ if (inp || !atomic_try_cmpxchg(&in_progress, &inp, 1))
+ return false;
+

You come here if (ts->last_tick_jiffies == jiffies). So it may be not necessary to check again.

TGLX had this in his rewrite suggestion, and I looked pretty intensely
at this test.

The situation I'm looking to resolve is caused by inter-NUMA links
being abnormally swamped with traffic. Especially for writes, access
to shared memory locations, such as the atomic operations to
in_progress right above this, take longer than one usually would
expect. So to me it makes sense that things may have changed since
the atomic_try_cmpxchg was initiated, and so I left the check in
place.

I see, one possibility is

- if it runs in parallel by that time on tick_cpu.( which always updates it)

+ if (ts->last_tick_jiffies == jiffies)
+ tick_do_update_jiffies64(now);
+ atomic_set(&in_progress, 0);
+ return true;
+}
+
#define MAX_STALLED_JIFFIES 5
static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
@@ -239,10 +260,11 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
ts->stalled_jiffies = 0;
ts->last_tick_jiffies = READ_ONCE(jiffies);
} else {
- if (++ts->stalled_jiffies == MAX_STALLED_JIFFIES) {
- tick_do_update_jiffies64(now);
- ts->stalled_jiffies = 0;
- ts->last_tick_jiffies = READ_ONCE(jiffies);
+ if (++ts->stalled_jiffies >= MAX_STALLED_JIFFIES) {
+ if (tick_limited_update_jiffies64(ts, now)) {
+ ts->stalled_jiffies = 0;
+ ts->last_tick_jiffies = READ_ONCE(jiffies);
+ }
}
}

Yes. This could help large systems.

Acked-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxx>

Thanks for your time reviewing!

--> Steve Wahl