[GIT pull] timers/core for v7.1-rc1

From: Thomas Gleixner

Date: Sun Apr 12 2026 - 13:50:44 EST

Linus,

please pull the latest timers/core branch from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers-core-2026-04-12

up to: ff1c0c5d0702: Merge branch 'timers/urgent' into timers/core

Updates for the timer/timekeeping core:

- A rework of the hrtimer subsystem to reduce the overhead for frequently
armed timers, especially the hrtick scheduler timer.

- Better timer locality decision

- Simplification of the evaluation of the first expiry time by
keeping track of the neighbor timers in the RB-tree by providing a
RB-tree variant with neighbor links. That avoids walking the
RB-tree on removal to find the next expiry time, but even more
important allows to quickly evaluate whether a timer which is
rearmed changes the position in the RB-tree with the modified
expiry time or not. If not, the dequeue/enqueue sequence which both
can end up in rebalancing can be completely avoided.

- Deferred reprogramming of the underlying clock event device. This
optimizes for the situation where a hrtimer callback sets the need
resched bit. In that case the code attempts to defer the
re-programming of the clock event device up to the point where the
scheduler has picked the next task and has the next hrtick timer
armed. In case that there is no immediate reschedule or soft
interrupts have to be handled before reaching the reschedule point
in the interrupt entry code the clock event is reprogrammed in one
of those code paths to prevent that the timer becomes stale.

- Support for clocksource coupled clockevents

The TSC deadline timer is coupled to the TSC. The next event is
programmed in TSC time. Currently this is done by converting the
CLOCK_MONOTONIC based expiry value into a relative timeout,
converting it into TSC ticks, reading the TSC adding the delta
ticks and writing the deadline MSR.

As the timekeeping core has the conversion factors for the TSC
already, the whole back and forth conversion can be completely
avoided. The timekeeping core calculates the reverse conversion
factors from nanoseconds to TSC ticks and utilizes the base
timestamps of TSC and CLOCK_MONOTONIC which are updated once per
tick. This allows a direct conversion into the TSC deadline value
without reading the time and as a bonus keeps the deadline
conversion in sync with the TSC conversion factors, which are
updated by adjtimex() on systems with NTP/PTP enabled.

- Allow inlining of the clocksource read and clockevent write
functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR.

With all those enhancements in place a hrtick enabled scheduler
provides the same performance as without hrtick. But also other hrtimer
users obviously benefit from these optimizations.

- Robustness improvements and cleanups of historical sins in the hrtimer
and timekeeping code.

- Rewrite of the clocksource watchdog.

The clocksource watchdog code has over time reached the state of an
impenetrable maze of duct tape and staples. The original design, which was
made in the context of systems far smaller than today, is based on the
assumption that the to be monitored clocksource (TSC) can be trivially
compared against a known to be stable clocksource (HPET/ACPI-PM timer).

Over the years this rather naive approach turned out to have major
flaws. Long delays between the watchdog invocations can cause wrap
arounds of the reference clocksource. The access to the reference
clocksource degrades on large multi-sockets systems dure to
interconnect congestion. This has been addressed with various
heuristics which degraded the accuracy of the watchdog to the point
that it fails to detect actual TSC problems on older hardware which
exposes slow inter CPU drifts due to firmware manipulating the TSC to
hide SMI time.

The rewrite addresses this by:

- Restricting the validation against the reference clocksource to the
boot CPU which is usually closest to the legacy block which
contains the reference clocksource (HPET/ACPI-PM).

- Do a round robin validation betwen the boot CPU and the other CPUs
based only on the TSC with an algorithm similar to the TSC
synchronization code during CPU hotplug.

- Being more leniant versus remote timeouts

- The usual tiny fixes, cleanups and enhancements all over the place

Thanks,

tglx

------------------>
Ingo Molnar (1):
sched/hrtick: Mark hrtick_clear() as always used

Josh Snyder (1):
tick/nohz: Fix inverted return value in check_tick_dependency() fast path

Peter Zijlstra (12):
sched/eevdf: Fix HRTICK duration
hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns()
hrtimer: Provide LAZY_REARM mode
sched/hrtick: Mark hrtick timer LAZY_REARM
hrtimer: Re-arrange hrtimer_interrupt()
hrtimer: Prepare stubs for deferred rearming
entry: Prepare for deferred hrtimer rearming
softirq: Prepare for deferred hrtimer rearming
sched/core: Prepare for deferred hrtimer rearming
hrtimer: Push reprogramming timers into the interrupt return path
sched: Default enable HRTICK when deferred rearming is enabled
hrtimer: Less agressive interrupt 'hang' handling

Peter Zijlstra (Intel) (2):
sched/fair: Simplify hrtick_update()
sched/fair: Make hrtick resched hard

Petr Pavlu (1):
jiffies: Remove unused __jiffy_arch_data

Ryota Sakamoto (1):
time/kunit: Add .kunitconfig

Shrikanth Hegde (1):
timers: Get this_cpu once while clearing the idle state

Thomas Gleixner (43):
sched: Avoid ktime_get() indirection
hrtimer: Provide a static branch based hrtimer_hres_enabled()
sched: Use hrtimer_highres_enabled()
sched: Optimize hrtimer handling
sched/hrtick: Avoid tiny hrtick rearms
tick/sched: Avoid hrtimer_cancel/start() sequence
clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME
timekeeping: Allow inlining clocksource::read()
x86: Inline TSC reads in timekeeping
x86/apic: Remove pointless fence in lapic_next_deadline()
x86/apic: Avoid the PVOPS indirection for the TSC deadline timer
timekeeping: Provide infrastructure for coupled clockevents
clockevents: Provide support for clocksource coupled comparators
x86/apic: Enable TSC coupled programming mode
hrtimer: Add debug object init assertion
hrtimer: Reduce trace noise in hrtimer_start()
hrtimer: Use guards where appropriate
hrtimer: Cleanup coding style and comments
hrtimer: Evaluate timer expiry only once
hrtimer: Replace the bitfield in hrtimer_cpu_base
hrtimer: Convert state and properties to boolean
hrtimer: Optimize for local timers
hrtimer: Use NOHZ information for locality
hrtimer: Separate remove/enqueue handling for local timers
hrtimer: Add hrtimer_rearm tracepoint
hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm
hrtimer: Avoid re-evaluation when nothing changed
hrtimer: Keep track of first expiring timer per clock base
hrtimer: Rework next event evaluation
hrtimer: Simplify run_hrtimer_queues()
hrtimer: Optimize for_each_active_base()
rbtree: Provide rbtree with links
timerqueue: Provide linked timerqueue
hrtimer: Use linked timerqueue
hrtimer: Try to modify timers in place
timekeeping: Initialize the coupled clocksource conversion completely
clocksource: Update clocksource::freq_khz on registration
parisc: Remove unused clocksource flags
MIPS: Don't select CLOCKSOURCE_WATCHDOG
x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly
clocksource: Don't use non-continuous clocksources as watchdog
clocksource: Rewrite watchdog code completely
clockevents: Prevent timer interrupt starvation

Thomas Weißschuh (Schneider Electric) (12):
scripts/gdb: timerlist: Adapt to move of tk_core
tracing: Use explicit array size instead of sentinel elements in symbol printing
timer_list: Print offset as signed integer
timekeeping/auxclock: Consistently use raw timekeeper for tk_setup_internals()
timekeeping: Mark offsets array as const
hrtimer: Remove hrtimer_get_expires_ns()
hrtimer: Don't zero-initialize ret in hrtimer_nanosleep()
hrtimer: Drop spurious space in 'enum hrtimer_base_type'
hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry event
hrtimer: Mark index and clockid of clock base as const
hrtimer: Remove trailing comma after HRTIMER_MAX_CLOCK_BASES
hrtimer: Add a helper to retrieve a hrtimer from its timerqueue node

Zhan Xusheng (3):
posix-timers: Fix stale function name in comment
hrtimer: Fix incorrect #endif comment for BITS_PER_LONG check
alarmtimer: Access timerqueue node under lock in suspend

Documentation/admin-guide/kernel-parameters.txt | 7 +-
MAINTAINERS | 1 +
arch/mips/Kconfig | 1 -
arch/parisc/kernel/time.c | 5 +-
arch/x86/Kconfig | 2 +
arch/x86/include/asm/clock_inlined.h | 22 +
arch/x86/include/asm/time.h | 1 -
arch/x86/kernel/apic/apic.c | 41 +-
arch/x86/kernel/hpet.c | 4 +-
arch/x86/kernel/tsc.c | 61 +-
drivers/clocksource/Kconfig | 1 -
drivers/clocksource/acpi_pm.c | 4 +-
include/asm-generic/thread_info_tif.h | 5 +-
include/linux/clockchips.h | 12 +-
include/linux/clocksource.h | 27 +-
include/linux/hrtimer.h | 64 +-
include/linux/hrtimer_defs.h | 83 +-
include/linux/hrtimer_rearm.h | 83 ++
include/linux/hrtimer_types.h | 19 +-
include/linux/irq-entry-common.h | 25 +-
include/linux/jiffies.h | 6 +-
include/linux/rbtree.h | 81 +-
include/linux/rbtree_types.h | 16 +
include/linux/rseq_entry.h | 16 +-
include/linux/timekeeper_internal.h | 8 +
include/linux/timerqueue.h | 56 +-
include/linux/timerqueue_types.h | 15 +-
include/linux/trace_events.h | 13 +-
include/trace/events/timer.h | 42 +-
include/trace/stages/stage3_trace_output.h | 40 +-
kernel/entry/common.c | 4 +-
kernel/sched/core.c | 91 +-
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 55 +-
kernel/sched/features.h | 5 +
kernel/sched/sched.h | 41 +-
kernel/softirq.c | 15 +-
kernel/time/.kunitconfig | 2 +
kernel/time/Kconfig | 28 +-
kernel/time/alarmtimer.c | 12 +-
kernel/time/clockevents.c | 71 +-
kernel/time/clocksource-wdtest.c | 268 +++---
kernel/time/clocksource.c | 805 ++++++++--------
kernel/time/hrtimer.c | 1128 +++++++++++++----------
kernel/time/jiffies.c | 1 -
kernel/time/posix-timers.c | 2 +-
kernel/time/tick-broadcast-hrtimer.c | 1 -
kernel/time/tick-broadcast.c | 8 +-
kernel/time/tick-common.c | 1 +
kernel/time/tick-sched.c | 30 +-
kernel/time/timekeeping.c | 203 +++-
kernel/time/timekeeping.h | 2 +
kernel/time/timer.c | 5 +-
kernel/time/timer_list.c | 16 +-
kernel/trace/trace_events_synth.c | 4 +-
kernel/trace/trace_output.c | 20 +-
kernel/trace/trace_syscalls.c | 3 +-
lib/rbtree.c | 17 +
lib/timerqueue.c | 14 +
scripts/gdb/linux/timerlist.py | 2 +-
60 files changed, 2222 insertions(+), 1395 deletions(-)
create mode 100644 arch/x86/include/asm/clock_inlined.h
create mode 100644 include/linux/hrtimer_rearm.h
create mode 100644 kernel/time/.kunitconfig

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 03a550630644..bd4e6c0b2f0a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7963,12 +7963,7 @@ Kernel parameters
(HPET or PM timer) on systems whose TSC frequency was
obtained from HW or FW using either an MSR or CPUID(0x15).
Warn if the difference is more than 500 ppm.
- [x86] watchdog: Use TSC as the watchdog clocksource with
- which to check other HW timers (HPET or PM timer), but
- only on systems where TSC has been deemed trustworthy.
- This will be suppressed by an earlier tsc=nowatchdog and
- can be overridden by a later tsc=nowatchdog. A console
- message will flag any such suppression or overriding.
+ [x86] watchdog: Enforce the clocksource watchdog on TSC

tsc_early_khz= [X86,EARLY] Skip early TSC calibration and use the given
value instead. Useful when the early TSC frequency discovery
diff --git a/MAINTAINERS b/MAINTAINERS
index c3fe46d7c4bc..292e9ce3b65e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26621,6 +26621,7 @@ F: include/linux/timekeeping.h
F: include/linux/timex.h
F: include/uapi/linux/time.h
F: include/uapi/linux/timex.h
+F: kernel/time/.kunitconfig
F: kernel/time/alarmtimer.c
F: kernel/time/clocksource*
F: kernel/time/ntp*
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index e48b62b4dc48..4364f3dba688 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -1131,7 +1131,6 @@ config CSRC_IOASIC
bool

config CSRC_R4K
- select CLOCKSOURCE_WATCHDOG if CPU_FREQ
bool

config CSRC_SB1250
diff --git a/arch/parisc/kernel/time.c b/arch/parisc/kernel/time.c
index 94dc48455dc6..71c9d5426995 100644
--- a/arch/parisc/kernel/time.c
+++ b/arch/parisc/kernel/time.c
@@ -210,12 +210,9 @@ static struct clocksource clocksource_cr16 = {
.read = read_cr16,
.mask = CLOCKSOURCE_MASK(BITS_PER_LONG),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
- CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY |
- CLOCK_SOURCE_VERIFY_PERCPU,
+ CLOCK_SOURCE_VALID_FOR_HRES,
};

-
/*
* timer interrupt and sched_clock() initialization
*/
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e2df1b147184..560d2ce8cedd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -141,6 +141,7 @@ config X86
select ARCH_USE_SYM_ANNOTATIONS
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANT_DEFAULT_BPF_JIT if X86_64
+ select ARCH_WANTS_CLOCKSOURCE_READ_INLINE if X86_64
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANTS_NO_INSTR
select ARCH_WANT_GENERAL_HUGETLB
@@ -163,6 +164,7 @@ config X86
select EDAC_SUPPORT
select GENERIC_CLOCKEVENTS_BROADCAST if X86_64 || (X86_32 && X86_LOCAL_APIC)
select GENERIC_CLOCKEVENTS_BROADCAST_IDLE if GENERIC_CLOCKEVENTS_BROADCAST
+ select GENERIC_CLOCKEVENTS_COUPLED_INLINE if X86_64
select GENERIC_CLOCKEVENTS_MIN_ADJUST
select GENERIC_CMOS_UPDATE
select GENERIC_CPU_AUTOPROBE
diff --git a/arch/x86/include/asm/clock_inlined.h b/arch/x86/include/asm/clock_inlined.h
new file mode 100644
index 000000000000..b2dee8db2fb9
--- /dev/null
+++ b/arch/x86/include/asm/clock_inlined.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CLOCK_INLINED_H
+#define _ASM_X86_CLOCK_INLINED_H
+
+#include <asm/tsc.h>
+
+struct clocksource;
+
+static __always_inline u64 arch_inlined_clocksource_read(struct clocksource *cs)
+{
+ return (u64)rdtsc_ordered();
+}
+
+struct clock_event_device;
+
+static __always_inline void
+arch_inlined_clockevent_set_next_coupled(u64 cycles, struct clock_event_device *evt)
+{
+ native_wrmsrq(MSR_IA32_TSC_DEADLINE, cycles);
+}
+
+#endif
diff --git a/arch/x86/include/asm/time.h b/arch/x86/include/asm/time.h
index f360104ed172..459780c3ed1f 100644
--- a/arch/x86/include/asm/time.h
+++ b/arch/x86/include/asm/time.h
@@ -7,7 +7,6 @@

extern void hpet_time_init(void);
extern bool pit_timer_init(void);
-extern bool tsc_clocksource_watchdog_disabled(void);

extern struct clock_event_device *global_clock_event;

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 961714e6adae..0c8970c4c3e3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -412,23 +412,21 @@ EXPORT_SYMBOL_GPL(setup_APIC_eilvt);
/*
* Program the next event, relative to now
*/
-static int lapic_next_event(unsigned long delta,
- struct clock_event_device *evt)
+static int lapic_next_event(unsigned long delta, struct clock_event_device *evt)
{
apic_write(APIC_TMICT, delta);
return 0;
}

-static int lapic_next_deadline(unsigned long delta,
- struct clock_event_device *evt)
+static int lapic_next_deadline(unsigned long delta, struct clock_event_device *evt)
{
- u64 tsc;
-
- /* This MSR is special and need a special fence: */
- weak_wrmsr_fence();
+ /*
+ * There is no weak_wrmsr_fence() required here as all of this is purely
+ * CPU local. Avoid the [ml]fence overhead.
+ */
+ u64 tsc = rdtsc();

- tsc = rdtsc();
- wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
+ native_wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
return 0;
}

@@ -452,7 +450,7 @@ static int lapic_timer_shutdown(struct clock_event_device *evt)
* the timer _and_ zero the counter registers:
*/
if (v & APIC_LVT_TIMER_TSCDEADLINE)
- wrmsrq(MSR_IA32_TSC_DEADLINE, 0);
+ native_wrmsrq(MSR_IA32_TSC_DEADLINE, 0);
else
apic_write(APIC_TMICT, 0);

@@ -549,6 +547,11 @@ static __init bool apic_validate_deadline_timer(void)

if (!boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
return false;
+
+ /* XEN_PV does not support it, but be paranoia about it */
+ if (boot_cpu_has(X86_FEATURE_XENPV))
+ goto clear;
+
if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
return true;

@@ -561,9 +564,11 @@ static __init bool apic_validate_deadline_timer(void)
if (boot_cpu_data.microcode >= rev)
return true;

- setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
pr_err(FW_BUG "TSC_DEADLINE disabled due to Errata; "
"please update microcode to version: 0x%x (or later)\n", rev);
+
+clear:
+ setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return false;
}

@@ -586,14 +591,14 @@ static void setup_APIC_timer(void)

if (this_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) {
levt->name = "lapic-deadline";
- levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC |
- CLOCK_EVT_FEAT_DUMMY);
+ levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_DUMMY);
+ levt->features |= CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED;
+ levt->cs_id = CSID_X86_TSC;
levt->set_next_event = lapic_next_deadline;
- clockevents_config_and_register(levt,
- tsc_khz * (1000 / TSC_DIVISOR),
- 0xF, ~0UL);
- } else
+ clockevents_config_and_register(levt, tsc_khz * (1000 / TSC_DIVISOR), 0xF, ~0UL);
+ } else {
clockevents_register_device(levt);
+ }

apic_update_vector(smp_processor_id(), LOCAL_TIMER_VECTOR, true);
}
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 610590e83445..8dc7b710e125 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -854,7 +854,7 @@ static struct clocksource clocksource_hpet = {
.rating = 250,
.read = read_hpet,
.mask = HPET_MASK,
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_CALIBRATED,
.resume = hpet_resume_counter,
};

@@ -1082,8 +1082,6 @@ int __init hpet_enable(void)
if (!hpet_counting())
goto out_nohpet;

- if (tsc_clocksource_watchdog_disabled())
- clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY;
clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);

if (id & HPET_ID_LEGSUP) {
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index d9aa694e43f3..c5110eb554bc 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -322,12 +322,16 @@ int __init notsc_setup(char *str)
return 1;
}
#endif
-
__setup("notsc", notsc_setup);

+enum {
+ TSC_WATCHDOG_AUTO,
+ TSC_WATCHDOG_OFF,
+ TSC_WATCHDOG_ON,
+};
+
static int no_sched_irq_time;
-static int no_tsc_watchdog;
-static int tsc_as_watchdog;
+static int tsc_watchdog;

static int __init tsc_setup(char *str)
{
@@ -337,25 +341,14 @@ static int __init tsc_setup(char *str)
no_sched_irq_time = 1;
if (!strcmp(str, "unstable"))
mark_tsc_unstable("boot parameter");
- if (!strcmp(str, "nowatchdog")) {
- no_tsc_watchdog = 1;
- if (tsc_as_watchdog)
- pr_alert("%s: Overriding earlier tsc=watchdog with tsc=nowatchdog\n",
- __func__);
- tsc_as_watchdog = 0;
- }
+ if (!strcmp(str, "nowatchdog"))
+ tsc_watchdog = TSC_WATCHDOG_OFF;
if (!strcmp(str, "recalibrate"))
tsc_force_recalibrate = 1;
- if (!strcmp(str, "watchdog")) {
- if (no_tsc_watchdog)
- pr_alert("%s: tsc=watchdog overridden by earlier tsc=nowatchdog\n",
- __func__);
- else
- tsc_as_watchdog = 1;
- }
+ if (!strcmp(str, "watchdog"))
+ tsc_watchdog = TSC_WATCHDOG_ON;
return 1;
}
-
__setup("tsc=", tsc_setup);

#define MAX_RETRIES 5
@@ -1175,7 +1168,6 @@ static int tsc_cs_enable(struct clocksource *cs)
static struct clocksource clocksource_tsc_early = {
.name = "tsc-early",
.rating = 299,
- .uncertainty_margin = 32 * NSEC_PER_MSEC,
.read = read_tsc,
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
@@ -1200,9 +1192,9 @@ static struct clocksource clocksource_tsc = {
.read = read_tsc,
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
- CLOCK_SOURCE_VALID_FOR_HRES |
+ CLOCK_SOURCE_CAN_INLINE_READ |
CLOCK_SOURCE_MUST_VERIFY |
- CLOCK_SOURCE_VERIFY_PERCPU,
+ CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT,
.id = CSID_X86_TSC,
.vdso_clock_mode = VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
@@ -1230,16 +1222,12 @@ EXPORT_SYMBOL_GPL(mark_tsc_unstable);

static void __init tsc_disable_clocksource_watchdog(void)
{
+ if (tsc_watchdog == TSC_WATCHDOG_ON)
+ return;
clocksource_tsc_early.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
}

-bool tsc_clocksource_watchdog_disabled(void)
-{
- return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) &&
- tsc_as_watchdog && !no_tsc_watchdog;
-}
-
static void __init check_system_tsc_reliable(void)
{
#if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC)
@@ -1394,6 +1382,8 @@ static void tsc_refine_calibration_work(struct work_struct *work)
(unsigned long)tsc_khz / 1000,
(unsigned long)tsc_khz % 1000);

+ clocksource_tsc.flags |= CLOCK_SOURCE_CALIBRATED;
+
/* Inform the TSC deadline clockevent devices about the recalibration */
lapic_update_tsc_freq();

@@ -1409,6 +1399,15 @@ static void tsc_refine_calibration_work(struct work_struct *work)
have_art = true;
clocksource_tsc.base = &art_base_clk;
}
+
+ /*
+ * Transfer the valid for high resolution flag if it was set on the
+ * early TSC already. That guarantees that there is no intermediate
+ * clocksource selected once the early TSC is unregistered.
+ */
+ if (clocksource_tsc_early.flags & CLOCK_SOURCE_VALID_FOR_HRES)
+ clocksource_tsc.flags |= CLOCK_SOURCE_VALID_FOR_HRES;
+
clocksource_register_khz(&clocksource_tsc, tsc_khz);
unreg:
clocksource_unregister(&clocksource_tsc_early);
@@ -1460,12 +1459,10 @@ static bool __init determine_cpu_tsc_frequencies(bool early)

if (early) {
cpu_khz = x86_platform.calibrate_cpu();
- if (tsc_early_khz) {
+ if (tsc_early_khz)
tsc_khz = tsc_early_khz;
- } else {
+ else
tsc_khz = x86_platform.calibrate_tsc();
- clocksource_tsc.freq_khz = tsc_khz;
- }
} else {
/* We should not be here with non-native cpu calibration */
WARN_ON(x86_platform.calibrate_cpu != native_calibrate_cpu);
@@ -1569,7 +1566,7 @@ void __init tsc_init(void)
return;
}

- if (tsc_clocksource_reliable || no_tsc_watchdog)
+ if (tsc_clocksource_reliable || tsc_watchdog == TSC_WATCHDOG_OFF)
tsc_disable_clocksource_watchdog();

clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index fd9112706545..d1a33a231a44 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -596,7 +596,6 @@ config CLKSRC_VERSATILE
config CLKSRC_MIPS_GIC
bool
depends on MIPS_GIC
- select CLOCKSOURCE_WATCHDOG
select TIMER_OF

config CLKSRC_PXA
diff --git a/drivers/clocksource/acpi_pm.c b/drivers/clocksource/acpi_pm.c
index b4330a01a566..67792937242f 100644
--- a/drivers/clocksource/acpi_pm.c
+++ b/drivers/clocksource/acpi_pm.c
@@ -98,7 +98,7 @@ static struct clocksource clocksource_acpi_pm = {
.rating = 200,
.read = acpi_pm_read,
.mask = (u64)ACPI_PM_MASK,
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_CALIBRATED,
.suspend = acpi_pm_suspend,
.resume = acpi_pm_resume,
};
@@ -243,8 +243,6 @@ static int __init init_acpi_pm_clocksource(void)
return -ENODEV;
}

- if (tsc_clocksource_watchdog_disabled())
- clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY;
return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
}

diff --git a/include/asm-generic/thread_info_tif.h b/include/asm-generic/thread_info_tif.h
index da1610a78f92..528e6fc7efe9 100644
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -41,11 +41,14 @@
#define _TIF_PATCH_PENDING BIT(TIF_PATCH_PENDING)

#ifdef HAVE_TIF_RESTORE_SIGMASK
-# define TIF_RESTORE_SIGMASK 10 // Restore signal mask in do_signal() */
+# define TIF_RESTORE_SIGMASK 10 // Restore signal mask in do_signal()
# define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK)
#endif

#define TIF_RSEQ 11 // Run RSEQ fast path
#define _TIF_RSEQ BIT(TIF_RSEQ)

+#define TIF_HRTIMER_REARM 12 // re-arm the timer
+#define _TIF_HRTIMER_REARM BIT(TIF_HRTIMER_REARM)
+
#endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index b0df28ddd394..6adb72761246 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -43,9 +43,9 @@ enum clock_event_state {
/*
* Clock event features
*/
-# define CLOCK_EVT_FEAT_PERIODIC 0x000001
-# define CLOCK_EVT_FEAT_ONESHOT 0x000002
-# define CLOCK_EVT_FEAT_KTIME 0x000004
+# define CLOCK_EVT_FEAT_PERIODIC 0x000001
+# define CLOCK_EVT_FEAT_ONESHOT 0x000002
+# define CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED 0x000004

/*
* x86(64) specific (mis)features:
@@ -73,6 +73,7 @@ enum clock_event_state {
* level handler of the event source
* @set_next_event: set next event function using a clocksource delta
* @set_next_ktime: set next event function using a direct ktime value
+ * @set_next_coupled: set next event function for clocksource coupled mode
* @next_event: local storage for the next event in oneshot mode
* @max_delta_ns: maximum delta value in ns
* @min_delta_ns: minimum delta value in ns
@@ -80,6 +81,8 @@ enum clock_event_state {
* @shift: nanoseconds to cycles divisor (power of two)
* @state_use_accessors:current state of the device, assigned by the core code
* @features: features
+ * @cs_id: Clocksource ID to denote the clocksource for coupled mode
+ * @next_event_forced: True if the last programming was a forced event
* @retries: number of forced programming retries
* @set_state_periodic: switch state to periodic
* @set_state_oneshot: switch state to oneshot
@@ -101,6 +104,7 @@ struct clock_event_device {
void (*event_handler)(struct clock_event_device *);
int (*set_next_event)(unsigned long evt, struct clock_event_device *);
int (*set_next_ktime)(ktime_t expires, struct clock_event_device *);
+ void (*set_next_coupled)(u64 cycles, struct clock_event_device *);
ktime_t next_event;
u64 max_delta_ns;
u64 min_delta_ns;
@@ -108,6 +112,8 @@ struct clock_event_device {
u32 shift;
enum clock_event_state state_use_accessors;
unsigned int features;
+ enum clocksource_ids cs_id;
+ unsigned int next_event_forced;
unsigned long retries;

int (*set_state_periodic)(struct clock_event_device *);
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 65b7c41471c3..ccf5c0ca26b7 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -44,8 +44,6 @@ struct module;
* @shift: Cycle to nanosecond divisor (power of two)
* @max_idle_ns: Maximum idle time permitted by the clocksource (nsecs)
* @maxadj: Maximum adjustment value to mult (~11%)
- * @uncertainty_margin: Maximum uncertainty in nanoseconds per half second.
- * Zero says to use default WATCHDOG_THRESHOLD.
* @archdata: Optional arch-specific data
* @max_cycles: Maximum safe cycle value which won't overflow on
* multiplication
@@ -105,7 +103,6 @@ struct clocksource {
u32 shift;
u64 max_idle_ns;
u32 maxadj;
- u32 uncertainty_margin;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
struct arch_clocksource_data archdata;
#endif
@@ -133,6 +130,7 @@ struct clocksource {
struct list_head wd_list;
u64 cs_last;
u64 wd_last;
+ unsigned int wd_cpu;
#endif
struct module *owner;
};
@@ -142,13 +140,19 @@ struct clocksource {
*/
#define CLOCK_SOURCE_IS_CONTINUOUS 0x01
#define CLOCK_SOURCE_MUST_VERIFY 0x02
+#define CLOCK_SOURCE_CALIBRATED 0x04

#define CLOCK_SOURCE_WATCHDOG 0x10
#define CLOCK_SOURCE_VALID_FOR_HRES 0x20
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
#define CLOCK_SOURCE_RESELECT 0x100
-#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
+#define CLOCK_SOURCE_CAN_INLINE_READ 0x200
+#define CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT 0x400
+
+#define CLOCK_SOURCE_WDTEST 0x800
+#define CLOCK_SOURCE_WDTEST_PERCPU 0x1000
+
/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)

@@ -298,21 +302,6 @@ static inline void timer_probe(void) {}
#define TIMER_ACPI_DECLARE(name, table_id, fn) \
ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)

-static inline unsigned int clocksource_get_max_watchdog_retry(void)
-{
- /*
- * When system is in the boot phase or under heavy workload, there
- * can be random big latencies during the clocksource/watchdog
- * read, so allow retries to filter the noise latency. As the
- * latency's frequency and maximum value goes up with the number of
- * CPUs, scale the number of retries with the number of online
- * CPUs.
- */
- return (ilog2(num_online_cpus()) / 2) + 1;
-}
-
-void clocksource_verify_percpu(struct clocksource *cs);
-
/**
* struct clocksource_base - hardware abstraction for clock on which a clocksource
* is based
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 74adbd4e7003..9ced498fefaa 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -13,6 +13,7 @@
#define _LINUX_HRTIMER_H

#include <linux/hrtimer_defs.h>
+#include <linux/hrtimer_rearm.h>
#include <linux/hrtimer_types.h>
#include <linux/init.h>
#include <linux/list.h>
@@ -31,6 +32,13 @@
* soft irq context
* HRTIMER_MODE_HARD - Timer callback function will be executed in
* hard irq context even on PREEMPT_RT.
+ * HRTIMER_MODE_LAZY_REARM - Avoid reprogramming if the timer was the
+ * first expiring timer and is moved into the
+ * future. Special mode for the HRTICK timer to
+ * avoid extensive reprogramming of the hardware,
+ * which is expensive in virtual machines. Risks
+ * a pointless expiry, but that's better than
+ * reprogramming on every context switch,
*/
enum hrtimer_mode {
HRTIMER_MODE_ABS = 0x00,
@@ -38,6 +46,7 @@ enum hrtimer_mode {
HRTIMER_MODE_PINNED = 0x02,
HRTIMER_MODE_SOFT = 0x04,
HRTIMER_MODE_HARD = 0x08,
+ HRTIMER_MODE_LAZY_REARM = 0x10,

HRTIMER_MODE_ABS_PINNED = HRTIMER_MODE_ABS | HRTIMER_MODE_PINNED,
HRTIMER_MODE_REL_PINNED = HRTIMER_MODE_REL | HRTIMER_MODE_PINNED,
@@ -55,33 +64,6 @@ enum hrtimer_mode {
HRTIMER_MODE_REL_PINNED_HARD = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_HARD,
};

-/*
- * Values to track state of the timer
- *
- * Possible states:
- *
- * 0x00 inactive
- * 0x01 enqueued into rbtree
- *
- * The callback state is not part of the timer->state because clearing it would
- * mean touching the timer after the callback, this makes it impossible to free
- * the timer from the callback function.
- *
- * Therefore we track the callback state in:
- *
- * timer->base->cpu_base->running == timer
- *
- * On SMP it is possible to have a "callback function running and enqueued"
- * status. It happens for example when a posix timer expired and the callback
- * queued a signal. Between dropping the lock which protects the posix timer
- * and reacquiring the base lock of the hrtimer, another CPU can deliver the
- * signal and rearm the timer.
- *
- * All state transitions are protected by cpu_base->lock.
- */
-#define HRTIMER_STATE_INACTIVE 0x00
-#define HRTIMER_STATE_ENQUEUED 0x01
-
/**
* struct hrtimer_sleeper - simple sleeper structure
* @timer: embedded timer structure
@@ -134,11 +116,6 @@ static inline ktime_t hrtimer_get_softexpires(const struct hrtimer *timer)
return timer->_softexpires;
}

-static inline s64 hrtimer_get_expires_ns(const struct hrtimer *timer)
-{
- return ktime_to_ns(timer->node.expires);
-}
-
ktime_t hrtimer_cb_get_time(const struct hrtimer *timer);

static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
@@ -146,24 +123,23 @@ static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
return ktime_sub(timer->node.expires, hrtimer_cb_get_time(timer));
}

-static inline int hrtimer_is_hres_active(struct hrtimer *timer)
-{
- return IS_ENABLED(CONFIG_HIGH_RES_TIMERS) ?
- timer->base->cpu_base->hres_active : 0;
-}
-
#ifdef CONFIG_HIGH_RES_TIMERS
+extern unsigned int hrtimer_resolution;
struct clock_event_device;

extern void hrtimer_interrupt(struct clock_event_device *dev);

-extern unsigned int hrtimer_resolution;
+extern struct static_key_false hrtimer_highres_enabled_key;

-#else
+static inline bool hrtimer_highres_enabled(void)
+{
+ return static_branch_likely(&hrtimer_highres_enabled_key);
+}

+#else /* CONFIG_HIGH_RES_TIMERS */
#define hrtimer_resolution (unsigned int)LOW_RES_NSEC
-
-#endif
+static inline bool hrtimer_highres_enabled(void) { return false; }
+#endif /* !CONFIG_HIGH_RES_TIMERS */

static inline ktime_t
__hrtimer_expires_remaining_adjusted(const struct hrtimer *timer, ktime_t now)
@@ -293,8 +269,8 @@ extern bool hrtimer_active(const struct hrtimer *timer);
*/
static inline bool hrtimer_is_queued(struct hrtimer *timer)
{
- /* The READ_ONCE pairs with the update functions of timer->state */
- return !!(READ_ONCE(timer->state) & HRTIMER_STATE_ENQUEUED);
+ /* The READ_ONCE pairs with the update functions of timer->is_queued */
+ return READ_ONCE(timer->is_queued);
}

/*
diff --git a/include/linux/hrtimer_defs.h b/include/linux/hrtimer_defs.h
index 02b010df6570..52ed9e46ff13 100644
--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -19,21 +19,23 @@
* timer to a base on another cpu.
* @clockid: clock id for per_cpu support
* @seq: seqcount around __run_hrtimer
+ * @expires_next: Absolute time of the next event in this clock base
* @running: pointer to the currently running hrtimer
* @active: red black tree root node for the active timers
* @offset: offset of this clock to the monotonic base
*/
struct hrtimer_clock_base {
- struct hrtimer_cpu_base *cpu_base;
- unsigned int index;
- clockid_t clockid;
- seqcount_raw_spinlock_t seq;
- struct hrtimer *running;
- struct timerqueue_head active;
- ktime_t offset;
+ struct hrtimer_cpu_base *cpu_base;
+ const unsigned int index;
+ const clockid_t clockid;
+ seqcount_raw_spinlock_t seq;
+ ktime_t expires_next;
+ struct hrtimer *running;
+ struct timerqueue_linked_head active;
+ ktime_t offset;
} __hrtimer_clock_base_align;

-enum hrtimer_base_type {
+enum hrtimer_base_type {
HRTIMER_BASE_MONOTONIC,
HRTIMER_BASE_REALTIME,
HRTIMER_BASE_BOOTTIME,
@@ -42,37 +44,36 @@ enum hrtimer_base_type {
HRTIMER_BASE_REALTIME_SOFT,
HRTIMER_BASE_BOOTTIME_SOFT,
HRTIMER_BASE_TAI_SOFT,
- HRTIMER_MAX_CLOCK_BASES,
+ HRTIMER_MAX_CLOCK_BASES
};

/**
* struct hrtimer_cpu_base - the per cpu clock bases
- * @lock: lock protecting the base and associated clock bases
- * and timers
- * @cpu: cpu number
- * @active_bases: Bitfield to mark bases with active timers
- * @clock_was_set_seq: Sequence counter of clock was set events
- * @hres_active: State of high resolution mode
- * @in_hrtirq: hrtimer_interrupt() is currently executing
- * @hang_detected: The last hrtimer interrupt detected a hang
- * @softirq_activated: displays, if the softirq is raised - update of softirq
- * related settings is not required then.
- * @nr_events: Total number of hrtimer interrupt events
- * @nr_retries: Total number of hrtimer interrupt retries
- * @nr_hangs: Total number of hrtimer interrupt hangs
- * @max_hang_time: Maximum time spent in hrtimer_interrupt
- * @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are
- * expired
- * @online: CPU is online from an hrtimers point of view
- * @timer_waiters: A hrtimer_cancel() invocation waits for the timer
- * callback to finish.
- * @expires_next: absolute time of the next event, is required for remote
- * hrtimer enqueue; it is the total first expiry time (hard
- * and soft hrtimer are taken into account)
- * @next_timer: Pointer to the first expiring timer
- * @softirq_expires_next: Time to check, if soft queues needs also to be expired
- * @softirq_next_timer: Pointer to the first expiring softirq based timer
- * @clock_base: array of clock bases for this cpu
+ * @lock: lock protecting the base and associated clock bases and timers
+ * @cpu: cpu number
+ * @active_bases: Bitfield to mark bases with active timers
+ * @clock_was_set_seq: Sequence counter of clock was set events
+ * @hres_active: State of high resolution mode
+ * @deferred_rearm: A deferred rearm is pending
+ * @deferred_needs_update: The deferred rearm must re-evaluate the first timer
+ * @hang_detected: The last hrtimer interrupt detected a hang
+ * @softirq_activated: displays, if the softirq is raised - update of softirq
+ * related settings is not required then.
+ * @nr_events: Total number of hrtimer interrupt events
+ * @nr_retries: Total number of hrtimer interrupt retries
+ * @nr_hangs: Total number of hrtimer interrupt hangs
+ * @max_hang_time: Maximum time spent in hrtimer_interrupt
+ * @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are expired
+ * @online: CPU is online from an hrtimers point of view
+ * @timer_waiters: A hrtimer_cancel() waiters for the timer callback to finish.
+ * @expires_next: Absolute time of the next event, is required for remote
+ * hrtimer enqueue; it is the total first expiry time (hard
+ * and soft hrtimer are taken into account)
+ * @next_timer: Pointer to the first expiring timer
+ * @softirq_expires_next: Time to check, if soft queues needs also to be expired
+ * @softirq_next_timer: Pointer to the first expiring softirq based timer
+ * @deferred_expires_next: Cached expires next value for deferred rearm
+ * @clock_base: Array of clock bases for this cpu
*
* Note: next_timer is just an optimization for __remove_hrtimer().
* Do not dereference the pointer because it is not reliable on
@@ -83,11 +84,12 @@ struct hrtimer_cpu_base {
unsigned int cpu;
unsigned int active_bases;
unsigned int clock_was_set_seq;
- unsigned int hres_active : 1,
- in_hrtirq : 1,
- hang_detected : 1,
- softirq_activated : 1,
- online : 1;
+ bool hres_active;
+ bool deferred_rearm;
+ bool deferred_needs_update;
+ bool hang_detected;
+ bool softirq_activated;
+ bool online;
#ifdef CONFIG_HIGH_RES_TIMERS
unsigned int nr_events;
unsigned short nr_retries;
@@ -102,6 +104,7 @@ struct hrtimer_cpu_base {
struct hrtimer *next_timer;
ktime_t softirq_expires_next;
struct hrtimer *softirq_next_timer;
+ ktime_t deferred_expires_next;
struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES];
call_single_data_t csd;
} ____cacheline_aligned;
diff --git a/include/linux/hrtimer_rearm.h b/include/linux/hrtimer_rearm.h
new file mode 100644
index 000000000000..a6f2e5d5e1c7
--- /dev/null
+++ b/include/linux/hrtimer_rearm.h
@@ -0,0 +1,83 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _LINUX_HRTIMER_REARM_H
+#define _LINUX_HRTIMER_REARM_H
+
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+#include <linux/thread_info.h>
+
+void __hrtimer_rearm_deferred(void);
+
+/*
+ * This is purely CPU local, so check the TIF bit first to avoid the overhead of
+ * the atomic test_and_clear_bit() operation for the common case where the bit
+ * is not set.
+ */
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred_tif(unsigned long tif_work)
+{
+ lockdep_assert_irqs_disabled();
+
+ if (unlikely(tif_work & _TIF_HRTIMER_REARM)) {
+ clear_thread_flag(TIF_HRTIMER_REARM);
+ return true;
+ }
+ return false;
+}
+
+#define TIF_REARM_MASK (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_HRTIMER_REARM)
+
+/* Invoked from the exit to user before invoking exit_to_user_mode_loop() */
+static __always_inline bool
+hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask)
+{
+ /* Help the compiler to optimize the function out for syscall returns */
+ if (!(tif_mask & _TIF_HRTIMER_REARM))
+ return false;
+ /*
+ * Rearm the timer if none of the resched flags is set before going into
+ * the loop which re-enables interrupts.
+ */
+ if (unlikely((*tif_work & TIF_REARM_MASK) == _TIF_HRTIMER_REARM)) {
+ clear_thread_flag(TIF_HRTIMER_REARM);
+ __hrtimer_rearm_deferred();
+ /* Don't go into the loop if HRTIMER_REARM was the only flag */
+ *tif_work &= ~TIF_HRTIMER_REARM;
+ return !*tif_work;
+ }
+ return false;
+}
+
+/* Invoked from the time slice extension decision function */
+static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work)
+{
+ if (hrtimer_test_and_clear_rearm_deferred_tif(tif_work))
+ __hrtimer_rearm_deferred();
+}
+
+/*
+ * This is to be called on all irqentry_exit() paths that will enable
+ * interrupts.
+ */
+static __always_inline void hrtimer_rearm_deferred(void)
+{
+ hrtimer_rearm_deferred_tif(read_thread_flags());
+}
+
+/*
+ * Invoked from the scheduler on entry to __schedule() so it can defer
+ * rearming after the load balancing callbacks which might change hrtick.
+ */
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void)
+{
+ return hrtimer_test_and_clear_rearm_deferred_tif(read_thread_flags());
+}
+
+#else /* CONFIG_HRTIMER_REARM_DEFERRED */
+static __always_inline void __hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work) { }
+static __always_inline bool
+hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask) { return false; }
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void) { return false; }
+#endif /* !CONFIG_HRTIMER_REARM_DEFERRED */
+
+#endif
diff --git a/include/linux/hrtimer_types.h b/include/linux/hrtimer_types.h
index 8fbbb6bdf7a1..b5dacc8271a4 100644
--- a/include/linux/hrtimer_types.h
+++ b/include/linux/hrtimer_types.h
@@ -17,7 +17,7 @@ enum hrtimer_restart {

/**
* struct hrtimer - the basic hrtimer structure
- * @node: timerqueue node, which also manages node.expires,
+ * @node: Linked timerqueue node, which also manages node.expires,
* the absolute expiry time in the hrtimers internal
* representation. The time is related to the clock on
* which the timer is based. Is setup by adding
@@ -28,23 +28,26 @@ enum hrtimer_restart {
* was armed.
* @function: timer expiry callback function
* @base: pointer to the timer base (per cpu and per clock)
- * @state: state information (See bit values above)
+ * @is_queued: Indicates whether a timer is enqueued or not
* @is_rel: Set if the timer was armed relative
* @is_soft: Set if hrtimer will be expired in soft interrupt context.
* @is_hard: Set if hrtimer will be expired in hard interrupt context
* even on RT.
+ * @is_lazy: Set if the timer is frequently rearmed to avoid updates
+ * of the clock event device
*
* The hrtimer structure must be initialized by hrtimer_setup()
*/
struct hrtimer {
- struct timerqueue_node node;
+ struct timerqueue_linked_node node;
+ struct hrtimer_clock_base *base;
+ bool is_queued;
+ bool is_rel;
+ bool is_soft;
+ bool is_hard;
+ bool is_lazy;
ktime_t _softexpires;
enum hrtimer_restart (*__private function)(struct hrtimer *);
- struct hrtimer_clock_base *base;
- u8 state;
- u8 is_rel;
- u8 is_soft;
- u8 is_hard;
};

#endif /* _LINUX_HRTIMER_TYPES_H */
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index d26d1b1bcbfb..b976946b3cdb 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -3,6 +3,7 @@
#define __LINUX_IRQENTRYCOMMON_H

#include <linux/context_tracking.h>
+#include <linux/hrtimer_rearm.h>
#include <linux/kmsan.h>
#include <linux/rseq_entry.h>
#include <linux/static_call_types.h>
@@ -33,6 +34,14 @@
_TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ | \
ARCH_EXIT_TO_USER_MODE_WORK)

+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+# define EXIT_TO_USER_MODE_WORK_SYSCALL (EXIT_TO_USER_MODE_WORK)
+# define EXIT_TO_USER_MODE_WORK_IRQ (EXIT_TO_USER_MODE_WORK | _TIF_HRTIMER_REARM)
+#else
+# define EXIT_TO_USER_MODE_WORK_SYSCALL (EXIT_TO_USER_MODE_WORK)
+# define EXIT_TO_USER_MODE_WORK_IRQ (EXIT_TO_USER_MODE_WORK)
+#endif
+
/**
* arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
* @regs: Pointer to currents pt_regs
@@ -203,6 +212,7 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work
/**
* __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
* @regs: Pointer to pt_regs on entry stack
+ * @work_mask: Which TIF bits need to be evaluated
*
* 1) check that interrupts are disabled
* 2) call tick_nohz_user_enter_prepare()
@@ -212,7 +222,8 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work
*
* Don't invoke directly, use the syscall/irqentry_ prefixed variants below
*/
-static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs,
+ const unsigned long work_mask)
{
unsigned long ti_work;

@@ -222,8 +233,10 @@ static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
tick_nohz_user_enter_prepare();

ti_work = read_thread_flags();
- if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
- ti_work = exit_to_user_mode_loop(regs, ti_work);
+ if (unlikely(ti_work & work_mask)) {
+ if (!hrtimer_rearm_deferred_user_irq(&ti_work, work_mask))
+ ti_work = exit_to_user_mode_loop(regs, ti_work);
+ }

arch_exit_to_user_mode_prepare(regs, ti_work);
}
@@ -239,7 +252,7 @@ static __always_inline void __exit_to_user_mode_validate(void)
/* Temporary workaround to keep ARM64 alive */
static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
{
- __exit_to_user_mode_prepare(regs);
+ __exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK);
rseq_exit_to_user_mode_legacy();
__exit_to_user_mode_validate();
}
@@ -253,7 +266,7 @@ static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *reg
*/
static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
{
- __exit_to_user_mode_prepare(regs);
+ __exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_SYSCALL);
rseq_syscall_exit_to_user_mode();
__exit_to_user_mode_validate();
}
@@ -267,7 +280,7 @@ static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *re
*/
static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
{
- __exit_to_user_mode_prepare(regs);
+ __exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_IRQ);
rseq_irqentry_exit_to_user_mode();
__exit_to_user_mode_validate();
}
diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index d1c3d4941854..bbd57061802c 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -67,10 +67,6 @@ extern void register_refined_jiffies(long clock_tick_rate);
/* USER_TICK_USEC is the time between ticks in usec assuming fake USER_HZ */
#define USER_TICK_USEC ((1000000UL + USER_HZ/2) / USER_HZ)

-#ifndef __jiffy_arch_data
-#define __jiffy_arch_data
-#endif
-
/*
* The 64-bit value is not atomic on 32-bit systems - you MUST NOT read it
* without sampling the sequence number in jiffies_lock.
@@ -83,7 +79,7 @@ extern void register_refined_jiffies(long clock_tick_rate);
* See arch/ARCH/kernel/vmlinux.lds.S
*/
extern u64 __cacheline_aligned_in_smp jiffies_64;
-extern unsigned long volatile __cacheline_aligned_in_smp __jiffy_arch_data jiffies;
+extern unsigned long volatile __cacheline_aligned_in_smp jiffies;

#if (BITS_PER_LONG < 64)
u64 get_jiffies_64(void);
diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 4091e978aef2..48acdc3889dd 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -35,10 +35,15 @@
#define RB_CLEAR_NODE(node) \
((node)->__rb_parent_color = (unsigned long)(node))

+#define RB_EMPTY_LINKED_NODE(lnode) RB_EMPTY_NODE(&(lnode)->node)
+#define RB_CLEAR_LINKED_NODE(lnode) ({ \
+ RB_CLEAR_NODE(&(lnode)->node); \
+ (lnode)->prev = (lnode)->next = NULL; \
+})

extern void rb_insert_color(struct rb_node *, struct rb_root *);
extern void rb_erase(struct rb_node *, struct rb_root *);
-
+extern bool rb_erase_linked(struct rb_node_linked *, struct rb_root_linked *);

/* Find logical next and previous nodes in a tree */
extern struct rb_node *rb_next(const struct rb_node *);
@@ -213,15 +218,10 @@ rb_add_cached(struct rb_node *node, struct rb_root_cached *tree,
return leftmost ? node : NULL;
}

-/**
- * rb_add() - insert @node into @tree
- * @node: node to insert
- * @tree: tree to insert @node into
- * @less: operator defining the (partial) node order
- */
static __always_inline void
-rb_add(struct rb_node *node, struct rb_root *tree,
- bool (*less)(struct rb_node *, const struct rb_node *))
+__rb_add(struct rb_node *node, struct rb_root *tree,
+ bool (*less)(struct rb_node *, const struct rb_node *),
+ void (*linkop)(struct rb_node *, struct rb_node *, struct rb_node **))
{
struct rb_node **link = &tree->rb_node;
struct rb_node *parent = NULL;
@@ -234,10 +234,73 @@ rb_add(struct rb_node *node, struct rb_root *tree,
link = &parent->rb_right;
}

+ linkop(node, parent, link);
rb_link_node(node, parent, link);
rb_insert_color(node, tree);
}

+#define __node_2_linked_node(_n) \
+ rb_entry((_n), struct rb_node_linked, node)
+
+static inline void
+rb_link_linked_node(struct rb_node *node, struct rb_node *parent, struct rb_node **link)
+{
+ if (!parent)
+ return;
+
+ struct rb_node_linked *nnew = __node_2_linked_node(node);
+ struct rb_node_linked *npar = __node_2_linked_node(parent);
+
+ if (link == &parent->rb_left) {
+ nnew->prev = npar->prev;
+ nnew->next = npar;
+ npar->prev = nnew;
+ if (nnew->prev)
+ nnew->prev->next = nnew;
+ } else {
+ nnew->next = npar->next;
+ nnew->prev = npar;
+ npar->next = nnew;
+ if (nnew->next)
+ nnew->next->prev = nnew;
+ }
+}
+
+/**
+ * rb_add_linked() - insert @node into the leftmost linked tree @tree
+ * @node: node to insert
+ * @tree: linked tree to insert @node into
+ * @less: operator defining the (partial) node order
+ *
+ * Returns @true when @node is the new leftmost, @false otherwise.
+ */
+static __always_inline bool
+rb_add_linked(struct rb_node_linked *node, struct rb_root_linked *tree,
+ bool (*less)(struct rb_node *, const struct rb_node *))
+{
+ __rb_add(&node->node, &tree->rb_root, less, rb_link_linked_node);
+ if (!node->prev)
+ tree->rb_leftmost = node;
+ return !node->prev;
+}
+
+/* Empty linkop function which is optimized away by the compiler */
+static __always_inline void
+rb_link_noop(struct rb_node *n, struct rb_node *p, struct rb_node **l) { }
+
+/**
+ * rb_add() - insert @node into @tree
+ * @node: node to insert
+ * @tree: tree to insert @node into
+ * @less: operator defining the (partial) node order
+ */
+static __always_inline void
+rb_add(struct rb_node *node, struct rb_root *tree,
+ bool (*less)(struct rb_node *, const struct rb_node *))
+{
+ __rb_add(node, tree, less, rb_link_noop);
+}
+
/**
* rb_find_add_cached() - find equivalent @node in @tree, or add @node
* @node: node to look-for / insert
diff --git a/include/linux/rbtree_types.h b/include/linux/rbtree_types.h
index 45b6ecde3665..3c7ae53e8139 100644
--- a/include/linux/rbtree_types.h
+++ b/include/linux/rbtree_types.h
@@ -9,6 +9,12 @@ struct rb_node {
} __attribute__((aligned(sizeof(long))));
/* The alignment might seem pointless, but allegedly CRIS needs it */

+struct rb_node_linked {
+ struct rb_node node;
+ struct rb_node_linked *prev;
+ struct rb_node_linked *next;
+};
+
struct rb_root {
struct rb_node *rb_node;
};
@@ -28,7 +34,17 @@ struct rb_root_cached {
struct rb_node *rb_leftmost;
};

+/*
+ * Leftmost tree with links. This would allow a trivial rb_rightmost update,
+ * but that has been omitted due to the lack of users.
+ */
+struct rb_root_linked {
+ struct rb_root rb_root;
+ struct rb_node_linked *rb_leftmost;
+};
+
#define RB_ROOT (struct rb_root) { NULL, }
#define RB_ROOT_CACHED (struct rb_root_cached) { {NULL, }, NULL }
+#define RB_ROOT_LINKED (struct rb_root_linked) { {NULL, }, NULL }

#endif
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index c6831c93cd6e..f11ebd34f8b9 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -40,6 +40,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
#endif /* !CONFIG_RSEQ_STATS */

#ifdef CONFIG_RSEQ
+#include <linux/hrtimer_rearm.h>
#include <linux/jump_label.h>
#include <linux/rseq.h>
#include <linux/sched/signal.h>
@@ -110,7 +111,7 @@ static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
t->rseq.slice.state.granted = false;
}

-static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+static __always_inline bool __rseq_grant_slice_extension(bool work_pending)
{
struct task_struct *curr = current;
struct rseq_slice_ctrl usr_ctrl;
@@ -215,11 +216,20 @@ static __always_inline bool rseq_grant_slice_extension(bool work_pending)
return false;
}

+static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask)
+{
+ if (unlikely(__rseq_grant_slice_extension(ti_work & mask))) {
+ hrtimer_rearm_deferred_tif(ti_work);
+ return true;
+ }
+ return false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static __always_inline bool rseq_slice_extension_enabled(void) { return false; }
static __always_inline bool rseq_arm_slice_extension_timer(void) { return false; }
static __always_inline void rseq_slice_clear_grant(struct task_struct *t) { }
-static __always_inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
+static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */

bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -778,7 +788,7 @@ static inline void rseq_syscall_exit_to_user_mode(void) { }
static inline void rseq_irqentry_exit_to_user_mode(void) { }
static inline void rseq_exit_to_user_mode_legacy(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
-static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
+static inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
#endif /* !CONFIG_RSEQ */

#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index b8ae89ea28ab..e36d11e33e0c 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -72,6 +72,10 @@ struct tk_read_base {
* @id: The timekeeper ID
* @tkr_raw: The readout base structure for CLOCK_MONOTONIC_RAW
* @raw_sec: CLOCK_MONOTONIC_RAW time in seconds
+ * @cs_id: The ID of the current clocksource
+ * @cs_ns_to_cyc_mult: Multiplicator for nanoseconds to cycles conversion
+ * @cs_ns_to_cyc_shift: Shift value for nanoseconds to cycles conversion
+ * @cs_ns_to_cyc_maxns: Maximum nanoseconds to cyles conversion range
* @clock_was_set_seq: The sequence number of clock was set events
* @cs_was_changed_seq: The sequence number of clocksource change events
* @clock_valid: Indicator for valid clock
@@ -159,6 +163,10 @@ struct timekeeper {
u64 raw_sec;

/* Cachline 3 and 4 (timekeeping internal variables): */
+ enum clocksource_ids cs_id;
+ u32 cs_ns_to_cyc_mult;
+ u32 cs_ns_to_cyc_shift;
+ u64 cs_ns_to_cyc_maxns;
unsigned int clock_was_set_seq;
u8 cs_was_changed_seq;
u8 clock_valid;
diff --git a/include/linux/timerqueue.h b/include/linux/timerqueue.h
index d306d9dd2207..7d0aaa766580 100644
--- a/include/linux/timerqueue.h
+++ b/include/linux/timerqueue.h
@@ -5,12 +5,11 @@
#include <linux/rbtree.h>
#include <linux/timerqueue_types.h>

-extern bool timerqueue_add(struct timerqueue_head *head,
- struct timerqueue_node *node);
-extern bool timerqueue_del(struct timerqueue_head *head,
- struct timerqueue_node *node);
-extern struct timerqueue_node *timerqueue_iterate_next(
- struct timerqueue_node *node);
+bool timerqueue_add(struct timerqueue_head *head, struct timerqueue_node *node);
+bool timerqueue_del(struct timerqueue_head *head, struct timerqueue_node *node);
+struct timerqueue_node *timerqueue_iterate_next(struct timerqueue_node *node);
+
+bool timerqueue_linked_add(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node);

/**
* timerqueue_getnext - Returns the timer with the earliest expiration time
@@ -19,8 +18,7 @@ extern struct timerqueue_node *timerqueue_iterate_next(
*
* Returns a pointer to the timer node that has the earliest expiration time.
*/
-static inline
-struct timerqueue_node *timerqueue_getnext(struct timerqueue_head *head)
+static inline struct timerqueue_node *timerqueue_getnext(struct timerqueue_head *head)
{
struct rb_node *leftmost = rb_first_cached(&head->rb_root);

@@ -41,4 +39,46 @@ static inline void timerqueue_init_head(struct timerqueue_head *head)
{
head->rb_root = RB_ROOT_CACHED;
}
+
+/* Timer queues with linked nodes */
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_first(struct timerqueue_linked_head *head)
+{
+ return rb_entry_safe(head->rb_root.rb_leftmost, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_next(struct timerqueue_linked_node *node)
+{
+ return rb_entry_safe(node->node.next, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_prev(struct timerqueue_linked_node *node)
+{
+ return rb_entry_safe(node->node.prev, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+bool timerqueue_linked_del(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node)
+{
+ return rb_erase_linked(&node->node, &head->rb_root);
+}
+
+static __always_inline void timerqueue_linked_init(struct timerqueue_linked_node *node)
+{
+ RB_CLEAR_LINKED_NODE(&node->node);
+}
+
+static __always_inline bool timerqueue_linked_node_queued(struct timerqueue_linked_node *node)
+{
+ return !RB_EMPTY_LINKED_NODE(&node->node);
+}
+
+static __always_inline void timerqueue_linked_init_head(struct timerqueue_linked_head *head)
+{
+ head->rb_root = RB_ROOT_LINKED;
+}
+
#endif /* _LINUX_TIMERQUEUE_H */
diff --git a/include/linux/timerqueue_types.h b/include/linux/timerqueue_types.h
index dc298d0923e3..be2218b147c4 100644
--- a/include/linux/timerqueue_types.h
+++ b/include/linux/timerqueue_types.h
@@ -6,12 +6,21 @@
#include <linux/types.h>

struct timerqueue_node {
- struct rb_node node;
- ktime_t expires;
+ struct rb_node node;
+ ktime_t expires;
};

struct timerqueue_head {
- struct rb_root_cached rb_root;
+ struct rb_root_cached rb_root;
+};
+
+struct timerqueue_linked_node {
+ struct rb_node_linked node;
+ ktime_t expires;
+};
+
+struct timerqueue_linked_head {
+ struct rb_root_linked rb_root;
};

#endif /* _LINUX_TIMERQUEUE_TYPES_H */
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 37eb2f0f3dd8..40a43a4c7caf 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -22,20 +22,23 @@ union bpf_attr;

const char *trace_print_flags_seq(struct trace_seq *p, const char *delim,
unsigned long flags,
- const struct trace_print_flags *flag_array);
+ const struct trace_print_flags *flag_array,
+ size_t flag_array_size);

const char *trace_print_symbols_seq(struct trace_seq *p, unsigned long val,
- const struct trace_print_flags *symbol_array);
+ const struct trace_print_flags *symbol_array,
+ size_t symbol_array_size);

#if BITS_PER_LONG == 32
const char *trace_print_flags_seq_u64(struct trace_seq *p, const char *delim,
unsigned long long flags,
- const struct trace_print_flags_u64 *flag_array);
+ const struct trace_print_flags_u64 *flag_array,
+ size_t flag_array_size);

const char *trace_print_symbols_seq_u64(struct trace_seq *p,
unsigned long long val,
- const struct trace_print_flags_u64
- *symbol_array);
+ const struct trace_print_flags_u64 *symbol_array,
+ size_t symbol_array_size);
#endif

struct trace_iterator;
diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 1641ae3e6ca0..07cbb9836b91 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -218,12 +218,13 @@ TRACE_EVENT(hrtimer_setup,
* hrtimer_start - called when the hrtimer is started
* @hrtimer: pointer to struct hrtimer
* @mode: the hrtimers mode
+ * @was_armed: Was armed when hrtimer_start*() was invoked
*/
TRACE_EVENT(hrtimer_start,

- TP_PROTO(struct hrtimer *hrtimer, enum hrtimer_mode mode),
+ TP_PROTO(struct hrtimer *hrtimer, enum hrtimer_mode mode, bool was_armed),

- TP_ARGS(hrtimer, mode),
+ TP_ARGS(hrtimer, mode, was_armed),

TP_STRUCT__entry(
__field( void *, hrtimer )
@@ -231,6 +232,7 @@ TRACE_EVENT(hrtimer_start,
__field( s64, expires )
__field( s64, softexpires )
__field( enum hrtimer_mode, mode )
+ __field( bool, was_armed )
),

TP_fast_assign(
@@ -239,26 +241,26 @@ TRACE_EVENT(hrtimer_start,
__entry->expires = hrtimer_get_expires(hrtimer);
__entry->softexpires = hrtimer_get_softexpires(hrtimer);
__entry->mode = mode;
+ __entry->was_armed = was_armed;
),

TP_printk("hrtimer=%p function=%ps expires=%llu softexpires=%llu "
- "mode=%s", __entry->hrtimer, __entry->function,
+ "mode=%s was_armed=%d", __entry->hrtimer, __entry->function,
(unsigned long long) __entry->expires,
(unsigned long long) __entry->softexpires,
- decode_hrtimer_mode(__entry->mode))
+ decode_hrtimer_mode(__entry->mode), __entry->was_armed)
);

/**
* hrtimer_expire_entry - called immediately before the hrtimer callback
* @hrtimer: pointer to struct hrtimer
- * @now: pointer to variable which contains current time of the
- * timers base.
+ * @now: variable which contains current time of the timers base.
*
* Allows to determine the timer latency.
*/
TRACE_EVENT(hrtimer_expire_entry,

- TP_PROTO(struct hrtimer *hrtimer, ktime_t *now),
+ TP_PROTO(struct hrtimer *hrtimer, ktime_t now),

TP_ARGS(hrtimer, now),

@@ -270,7 +272,7 @@ TRACE_EVENT(hrtimer_expire_entry,

TP_fast_assign(
__entry->hrtimer = hrtimer;
- __entry->now = *now;
+ __entry->now = now;
__entry->function = ACCESS_PRIVATE(hrtimer, function);
),

@@ -321,6 +323,30 @@ DEFINE_EVENT(hrtimer_class, hrtimer_cancel,
TP_ARGS(hrtimer)
);

+/**
+ * hrtimer_rearm - Invoked when the clockevent device is rearmed
+ * @next_event: The next expiry time (CLOCK_MONOTONIC)
+ */
+TRACE_EVENT(hrtimer_rearm,
+
+ TP_PROTO(ktime_t next_event, bool deferred),
+
+ TP_ARGS(next_event, deferred),
+
+ TP_STRUCT__entry(
+ __field( s64, next_event )
+ __field( bool, deferred )
+ ),
+
+ TP_fast_assign(
+ __entry->next_event = next_event;
+ __entry->deferred = deferred;
+ ),
+
+ TP_printk("next_event=%llu deferred=%d",
+ (unsigned long long) __entry->next_event, __entry->deferred)
+);
+
/**
* itimer_state - called when itimer is started or canceled
* @which: name of the interval timer
diff --git a/include/trace/stages/stage3_trace_output.h b/include/trace/stages/stage3_trace_output.h
index fce85ea2df1c..b7d8ef4b9fe1 100644
--- a/include/trace/stages/stage3_trace_output.h
+++ b/include/trace/stages/stage3_trace_output.h
@@ -64,36 +64,36 @@
#define __get_rel_sockaddr(field) ((struct sockaddr *)__get_rel_dynamic_array(field))

#undef __print_flags
-#define __print_flags(flag, delim, flag_array...) \
- ({ \
- static const struct trace_print_flags __flags[] = \
- { flag_array, { -1, NULL }}; \
- trace_print_flags_seq(p, delim, flag, __flags); \
+#define __print_flags(flag, delim, flag_array...) \
+ ({ \
+ static const struct trace_print_flags __flags[] = \
+ { flag_array }; \
+ trace_print_flags_seq(p, delim, flag, __flags, ARRAY_SIZE(__flags)); \
})

#undef __print_symbolic
-#define __print_symbolic(value, symbol_array...) \
- ({ \
- static const struct trace_print_flags symbols[] = \
- { symbol_array, { -1, NULL }}; \
- trace_print_symbols_seq(p, value, symbols); \
+#define __print_symbolic(value, symbol_array...) \
+ ({ \
+ static const struct trace_print_flags symbols[] = \
+ { symbol_array }; \
+ trace_print_symbols_seq(p, value, symbols, ARRAY_SIZE(symbols)); \
})

#undef __print_flags_u64
#undef __print_symbolic_u64
#if BITS_PER_LONG == 32
-#define __print_flags_u64(flag, delim, flag_array...) \
- ({ \
- static const struct trace_print_flags_u64 __flags[] = \
- { flag_array, { -1, NULL } }; \
- trace_print_flags_seq_u64(p, delim, flag, __flags); \
+#define __print_flags_u64(flag, delim, flag_array...) \
+ ({ \
+ static const struct trace_print_flags_u64 __flags[] = \
+ { flag_array }; \
+ trace_print_flags_seq_u64(p, delim, flag, __flags, ARRAY_SIZE(__flags)); \
})

-#define __print_symbolic_u64(value, symbol_array...) \
- ({ \
- static const struct trace_print_flags_u64 symbols[] = \
- { symbol_array, { -1, NULL } }; \
- trace_print_symbols_seq_u64(p, value, symbols); \
+#define __print_symbolic_u64(value, symbol_array...) \
+ ({ \
+ static const struct trace_print_flags_u64 symbols[] = \
+ { symbol_array }; \
+ trace_print_symbols_seq_u64(p, value, symbols, ARRAY_SIZE(symbols)); \
})
#else
#define __print_flags_u64(flag, delim, flag_array...) \
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 9ef63e414791..9e1a6afb07f2 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -50,7 +50,7 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
local_irq_enable_exit_to_user(ti_work);

if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
- if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+ if (!rseq_grant_slice_extension(ti_work, TIF_SLICE_EXT_DENY))
schedule();
}

@@ -225,6 +225,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
*/
if (state.exit_rcu) {
instrumentation_begin();
+ hrtimer_rearm_deferred();
/* Tell the tracer that IRET will enable interrupts */
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare();
@@ -238,6 +239,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
if (IS_ENABLED(CONFIG_PREEMPTION))
irqentry_exit_cond_resched();

+ hrtimer_rearm_deferred();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dca..4495929f4c9b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -872,7 +872,14 @@ void update_rq_clock(struct rq *rq)
* Use HR-timers to deliver accurate preemption points.
*/

-static void hrtick_clear(struct rq *rq)
+enum {
+ HRTICK_SCHED_NONE = 0,
+ HRTICK_SCHED_DEFER = BIT(1),
+ HRTICK_SCHED_START = BIT(2),
+ HRTICK_SCHED_REARM_HRTIMER = BIT(3)
+};
+
+static void __used hrtick_clear(struct rq *rq)
{
if (hrtimer_active(&rq->hrtick_timer))
hrtimer_cancel(&rq->hrtick_timer);
@@ -897,12 +904,24 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
return HRTIMER_NORESTART;
}

-static void __hrtick_restart(struct rq *rq)
+static inline bool hrtick_needs_rearm(struct hrtimer *timer, ktime_t expires)
+{
+ /*
+ * Queued is false when the timer is not started or currently
+ * running the callback. In both cases, restart. If queued check
+ * whether the expiry time actually changes substantially.
+ */
+ return !hrtimer_is_queued(timer) ||
+ abs(expires - hrtimer_get_expires(timer)) > 5000;
+}
+
+static void hrtick_cond_restart(struct rq *rq)
{
struct hrtimer *timer = &rq->hrtick_timer;
ktime_t time = rq->hrtick_time;

- hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
+ if (hrtick_needs_rearm(timer, time))
+ hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
}

/*
@@ -914,7 +933,7 @@ static void __hrtick_start(void *arg)
struct rq_flags rf;

rq_lock(rq, &rf);
- __hrtick_restart(rq);
+ hrtick_cond_restart(rq);
rq_unlock(rq, &rf);
}

@@ -925,7 +944,6 @@ static void __hrtick_start(void *arg)
*/
void hrtick_start(struct rq *rq, u64 delay)
{
- struct hrtimer *timer = &rq->hrtick_timer;
s64 delta;

/*
@@ -933,27 +951,67 @@ void hrtick_start(struct rq *rq, u64 delay)
* doesn't make sense and can cause timer DoS.
*/
delta = max_t(s64, delay, 10000LL);
- rq->hrtick_time = ktime_add_ns(hrtimer_cb_get_time(timer), delta);
+
+ /*
+ * If this is in the middle of schedule() only note the delay
+ * and let hrtick_schedule_exit() deal with it.
+ */
+ if (rq->hrtick_sched) {
+ rq->hrtick_sched |= HRTICK_SCHED_START;
+ rq->hrtick_delay = delta;
+ return;
+ }
+
+ rq->hrtick_time = ktime_add_ns(ktime_get(), delta);
+ if (!hrtick_needs_rearm(&rq->hrtick_timer, rq->hrtick_time))
+ return;

if (rq == this_rq())
- __hrtick_restart(rq);
+ hrtimer_start(&rq->hrtick_timer, rq->hrtick_time, HRTIMER_MODE_ABS_PINNED_HARD);
else
smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
}

-static void hrtick_rq_init(struct rq *rq)
+static inline void hrtick_schedule_enter(struct rq *rq)
{
- INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
- hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+ rq->hrtick_sched = HRTICK_SCHED_DEFER;
+ if (hrtimer_test_and_clear_rearm_deferred())
+ rq->hrtick_sched |= HRTICK_SCHED_REARM_HRTIMER;
}
-#else /* !CONFIG_SCHED_HRTICK: */
-static inline void hrtick_clear(struct rq *rq)
+
+static inline void hrtick_schedule_exit(struct rq *rq)
{
+ if (rq->hrtick_sched & HRTICK_SCHED_START) {
+ rq->hrtick_time = ktime_add_ns(ktime_get(), rq->hrtick_delay);
+ hrtick_cond_restart(rq);
+ } else if (idle_rq(rq)) {
+ /*
+ * No need for using hrtimer_is_active(). The timer is CPU local
+ * and interrupts are disabled, so the callback cannot be
+ * running and the queued state is valid.
+ */
+ if (hrtimer_is_queued(&rq->hrtick_timer))
+ hrtimer_cancel(&rq->hrtick_timer);
+ }
+
+ if (rq->hrtick_sched & HRTICK_SCHED_REARM_HRTIMER)
+ __hrtimer_rearm_deferred();
+
+ rq->hrtick_sched = HRTICK_SCHED_NONE;
}

-static inline void hrtick_rq_init(struct rq *rq)
+static void hrtick_rq_init(struct rq *rq)
{
+ INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
+ rq->hrtick_sched = HRTICK_SCHED_NONE;
+ hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL_HARD | HRTIMER_MODE_LAZY_REARM);
}
+#else /* !CONFIG_SCHED_HRTICK: */
+static inline void hrtick_clear(struct rq *rq) { }
+static inline void hrtick_rq_init(struct rq *rq) { }
+static inline void hrtick_schedule_enter(struct rq *rq) { }
+static inline void hrtick_schedule_exit(struct rq *rq) { }
#endif /* !CONFIG_SCHED_HRTICK */

/*
@@ -5032,6 +5090,7 @@ static inline void finish_lock_switch(struct rq *rq)
*/
spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
__balance_callbacks(rq, NULL);
+ hrtick_schedule_exit(rq);
raw_spin_rq_unlock_irq(rq);
}

@@ -6785,9 +6844,6 @@ static void __sched notrace __schedule(int sched_mode)

schedule_debug(prev, preempt);

- if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
- hrtick_clear(rq);
-
klp_sched_try_switch(prev);

local_irq_disable();
@@ -6814,6 +6870,8 @@ static void __sched notrace __schedule(int sched_mode)
rq_lock(rq, &rf);
smp_mb__after_spinlock();

+ hrtick_schedule_enter(rq);
+
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
@@ -6916,6 +6974,7 @@ static void __sched notrace __schedule(int sched_mode)

rq_unpin_lock(rq, &rf);
__balance_callbacks(rq, NULL);
+ hrtick_schedule_exit(rq);
raw_spin_rq_unlock_irq(rq);
}
trace_sched_exit_tp(is_switch);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b00429323..9d619a4ec3d1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1097,7 +1097,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
act = ns_to_ktime(dl_next_period(dl_se));
}

- now = hrtimer_cb_get_time(timer);
+ now = ktime_get();
delta = ktime_to_ns(now) - rq_clock(rq);
act = ktime_add_ns(act, delta);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab4114712be7..2be80780ff51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5600,7 +5600,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
* validating it and just reschedule.
*/
if (queued) {
- resched_curr_lazy(rq_of(cfs_rq));
+ resched_curr(rq_of(cfs_rq));
return;
}
#endif
@@ -6805,27 +6805,41 @@ static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct
static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
{
struct sched_entity *se = &p->se;
+ unsigned long scale = 1024;
+ unsigned long util = 0;
+ u64 vdelta;
+ u64 delta;

WARN_ON_ONCE(task_rq(p) != rq);

- if (rq->cfs.h_nr_queued > 1) {
- u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
- u64 slice = se->slice;
- s64 delta = slice - ran;
+ if (rq->cfs.h_nr_queued <= 1)
+ return;

- if (delta < 0) {
- if (task_current_donor(rq, p))
- resched_curr(rq);
- return;
- }
- hrtick_start(rq, delta);
+ /*
+ * Compute time until virtual deadline
+ */
+ vdelta = se->deadline - se->vruntime;
+ if ((s64)vdelta < 0) {
+ if (task_current_donor(rq, p))
+ resched_curr(rq);
+ return;
}
+ delta = (se->load.weight * vdelta) / NICE_0_LOAD;
+
+ /*
+ * Correct for instantaneous load of other classes.
+ */
+ util += cpu_util_irq(rq);
+ if (util && util < 1024) {
+ scale *= 1024;
+ scale /= (1024 - util);
+ }
+
+ hrtick_start(rq, (scale * delta) / 1024);
}

/*
- * called from enqueue/dequeue and updates the hrtick when the
- * current task is from our class and nr_running is low enough
- * to matter.
+ * Called on enqueue to start the hrtick when h_nr_queued becomes more than 1.
*/
static void hrtick_update(struct rq *rq)
{
@@ -6834,6 +6848,9 @@ static void hrtick_update(struct rq *rq)
if (!hrtick_enabled_fair(rq) || donor->sched_class != &fair_sched_class)
return;

+ if (hrtick_active(rq))
+ return;
+
hrtick_start_fair(rq, donor);
}
#else /* !CONFIG_SCHED_HRTICK: */
@@ -7156,9 +7173,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
WARN_ON_ONCE(!task_sleep);
WARN_ON_ONCE(p->on_rq != 1);

- /* Fix-up what dequeue_task_fair() skipped */
- hrtick_update(rq);
-
/*
* Fix-up what block_task() skipped.
*
@@ -7192,8 +7206,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/*
* Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
*/
-
- hrtick_update(rq);
return true;
}

@@ -13435,11 +13447,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
entity_tick(cfs_rq, se, queued);
}

- if (queued) {
- if (!need_resched())
- hrtick_start_fair(rq, curr);
+ if (queued)
return;
- }

if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 136a6584be79..d06228462607 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,8 +63,13 @@ SCHED_FEAT(DELAY_ZERO, true)
*/
SCHED_FEAT(WAKEUP_PREEMPTION, true)

+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+SCHED_FEAT(HRTICK, true)
+SCHED_FEAT(HRTICK_DL, true)
+#else
SCHED_FEAT(HRTICK, false)
SCHED_FEAT(HRTICK_DL, false)
+#endif

/*
* Decrement CPU capacity based on time not spent running tasks
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1ef9ba480f51..a67c73ecdf79 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1288,6 +1288,8 @@ struct rq {
call_single_data_t hrtick_csd;
struct hrtimer hrtick_timer;
ktime_t hrtick_time;
+ ktime_t hrtick_delay;
+ unsigned int hrtick_sched;
#endif

#ifdef CONFIG_SCHEDSTATS
@@ -3033,46 +3035,31 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
* - enabled by features
* - hrtimer is actually high res
*/
-static inline int hrtick_enabled(struct rq *rq)
+static inline bool hrtick_enabled(struct rq *rq)
{
- if (!cpu_active(cpu_of(rq)))
- return 0;
- return hrtimer_is_hres_active(&rq->hrtick_timer);
+ return cpu_active(cpu_of(rq)) && hrtimer_highres_enabled();
}

-static inline int hrtick_enabled_fair(struct rq *rq)
+static inline bool hrtick_enabled_fair(struct rq *rq)
{
- if (!sched_feat(HRTICK))
- return 0;
- return hrtick_enabled(rq);
+ return sched_feat(HRTICK) && hrtick_enabled(rq);
}

-static inline int hrtick_enabled_dl(struct rq *rq)
+static inline bool hrtick_enabled_dl(struct rq *rq)
{
- if (!sched_feat(HRTICK_DL))
- return 0;
- return hrtick_enabled(rq);
+ return sched_feat(HRTICK_DL) && hrtick_enabled(rq);
}

extern void hrtick_start(struct rq *rq, u64 delay);
-
-#else /* !CONFIG_SCHED_HRTICK: */
-
-static inline int hrtick_enabled_fair(struct rq *rq)
-{
- return 0;
-}
-
-static inline int hrtick_enabled_dl(struct rq *rq)
-{
- return 0;
-}
-
-static inline int hrtick_enabled(struct rq *rq)
+static inline bool hrtick_active(struct rq *rq)
{
- return 0;
+ return hrtimer_active(&rq->hrtick_timer);
}

+#else /* !CONFIG_SCHED_HRTICK: */
+static inline bool hrtick_enabled_fair(struct rq *rq) { return false; }
+static inline bool hrtick_enabled_dl(struct rq *rq) { return false; }
+static inline bool hrtick_enabled(struct rq *rq) { return false; }
#endif /* !CONFIG_SCHED_HRTICK */

#ifndef arch_scale_freq_tick
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 77198911b8dd..4425d8dce44b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -663,6 +663,13 @@ void irq_enter_rcu(void)
{
__irq_enter_raw();

+ /*
+ * If this is a nested interrupt that hits the exit_to_user_mode_loop
+ * where it has enabled interrupts but before it has hit schedule() we
+ * could have hrtimers in an undefined state. Fix it up here.
+ */
+ hrtimer_rearm_deferred();
+
if (tick_nohz_full_cpu(smp_processor_id()) ||
(is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
tick_irq_enter();
@@ -719,8 +726,14 @@ static inline void __irq_exit_rcu(void)
#endif
account_hardirq_exit(current);
preempt_count_sub(HARDIRQ_OFFSET);
- if (!in_interrupt() && local_softirq_pending())
+ if (!in_interrupt() && local_softirq_pending()) {
+ /*
+ * If we left hrtimers unarmed, make sure to arm them now,
+ * before enabling interrupts to run SoftIRQ.
+ */
+ hrtimer_rearm_deferred();
invoke_softirq();
+ }

if (IS_ENABLED(CONFIG_IRQ_FORCED_THREADING) && force_irqthreads() &&
local_timers_pending_force_th() && !(in_nmi() | in_hardirq()))
diff --git a/kernel/time/.kunitconfig b/kernel/time/.kunitconfig
new file mode 100644
index 000000000000..d60a611b2853
--- /dev/null
+++ b/kernel/time/.kunitconfig
@@ -0,0 +1,2 @@
+CONFIG_KUNIT=y
+CONFIG_TIME_KUNIT_TEST=y
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 7c6a52f7836c..6a11964377e6 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -17,6 +17,9 @@ config ARCH_CLOCKSOURCE_DATA
config ARCH_CLOCKSOURCE_INIT
bool

+config ARCH_WANTS_CLOCKSOURCE_READ_INLINE
+ bool
+
# Timekeeping vsyscall support
config GENERIC_TIME_VSYSCALL
bool
@@ -44,10 +47,23 @@ config GENERIC_CLOCKEVENTS_BROADCAST_IDLE
config GENERIC_CLOCKEVENTS_MIN_ADJUST
bool

+config GENERIC_CLOCKEVENTS_COUPLED
+ bool
+
+config GENERIC_CLOCKEVENTS_COUPLED_INLINE
+ select GENERIC_CLOCKEVENTS_COUPLED
+ bool
+
# Generic update of CMOS clock
config GENERIC_CMOS_UPDATE
bool

+# Deferred rearming of the hrtimer interrupt
+config HRTIMER_REARM_DEFERRED
+ def_bool y
+ depends on GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ depends on HIGH_RES_TIMERS && SCHED_HRTICK
+
# Select to handle posix CPU timers from task_work
# and not from the timer interrupt context
config HAVE_POSIX_CPU_TIMERS_TASK_WORK
@@ -196,18 +212,6 @@ config HIGH_RES_TIMERS
hardware is not capable then this option only increases
the size of the kernel image.

-config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
- int "Clocksource watchdog maximum allowable skew (in microseconds)"
- depends on CLOCKSOURCE_WATCHDOG
- range 50 1000
- default 125
- help
- Specify the maximum amount of allowable watchdog skew in
- microseconds before reporting the clocksource to be unstable.
- The default is based on a half-second clocksource watchdog
- interval and NTP's maximum frequency drift of 500 parts
- per million. If the clocksource is good enough for NTP,
- it is good enough for the clocksource watchdog!
endif

config POSIX_AUX_CLOCKS
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index b64db405ba5c..6e173d70d825 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -234,19 +234,23 @@ static int alarmtimer_suspend(struct device *dev)
if (!rtc)
return 0;

- /* Find the soonest timer to expire*/
+ /* Find the soonest timer to expire */
for (i = 0; i < ALARM_NUMTYPE; i++) {
struct alarm_base *base = &alarm_bases[i];
struct timerqueue_node *next;
+ ktime_t next_expires;
ktime_t delta;

- scoped_guard(spinlock_irqsave, &base->lock)
+ scoped_guard(spinlock_irqsave, &base->lock) {
next = timerqueue_getnext(&base->timerqueue);
+ if (next)
+ next_expires = next->expires;
+ }
if (!next)
continue;
- delta = ktime_sub(next->expires, base->get_ktime());
+ delta = ktime_sub(next_expires, base->get_ktime());
if (!min || (delta < min)) {
- expires = next->expires;
+ expires = next_expires;
min = delta;
type = i;
}
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index eaae1ce9f060..b4d730604972 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -172,6 +172,7 @@ void clockevents_shutdown(struct clock_event_device *dev)
{
clockevents_switch_state(dev, CLOCK_EVT_STATE_SHUTDOWN);
dev->next_event = KTIME_MAX;
+ dev->next_event_forced = 0;
}

/**
@@ -292,6 +293,38 @@ static int clockevents_program_min_delta(struct clock_event_device *dev)

#endif /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */

+#ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE
+#include <asm/clock_inlined.h>
+#else
+static __always_inline void
+arch_inlined_clockevent_set_next_coupled(u64 u64 cycles, struct clock_event_device *dev) { }
+#endif
+
+static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
+{
+ u64 cycles;
+
+ if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED)))
+ return false;
+
+ if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
+ return false;
+
+ if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))
+ arch_inlined_clockevent_set_next_coupled(cycles, dev);
+ else
+ dev->set_next_coupled(cycles, dev);
+ return true;
+}
+
+#else
+static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
+{
+ return false;
+}
+#endif
+
/**
* clockevents_program_event - Reprogram the clock event device.
* @dev: device to program
@@ -300,12 +333,10 @@ static int clockevents_program_min_delta(struct clock_event_device *dev)
*
* Returns 0 on success, -ETIME when the event is in the past.
*/
-int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
- bool force)
+int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, bool force)
{
- unsigned long long clc;
int64_t delta;
- int rc;
+ u64 cycles;

if (WARN_ON_ONCE(expires < 0))
return -ETIME;
@@ -319,21 +350,35 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
WARN_ONCE(!clockevent_state_oneshot(dev), "Current state: %d\n",
clockevent_get_state(dev));

- /* Shortcut for clockevent devices that can deal with ktime. */
- if (dev->features & CLOCK_EVT_FEAT_KTIME)
+ /* ktime_t based reprogramming for the broadcast hrtimer device */
+ if (unlikely(dev->features & CLOCK_EVT_FEAT_HRTIMER))
return dev->set_next_ktime(expires, dev);

+ if (likely(clockevent_set_next_coupled(dev, expires)))
+ return 0;
+
delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
- if (delta <= 0)
- return force ? clockevents_program_min_delta(dev) : -ETIME;

- delta = min(delta, (int64_t) dev->max_delta_ns);
- delta = max(delta, (int64_t) dev->min_delta_ns);
+ /* Required for tick_periodic() during early boot */
+ if (delta <= 0 && !force)
+ return -ETIME;
+
+ if (delta > (int64_t)dev->min_delta_ns) {
+ delta = min(delta, (int64_t) dev->max_delta_ns);
+ cycles = ((u64)delta * dev->mult) >> dev->shift;
+ if (!dev->set_next_event((unsigned long) cycles, dev))
+ return 0;
+ }

- clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
- rc = dev->set_next_event((unsigned long) clc, dev);
+ if (dev->next_event_forced)
+ return 0;

- return (rc && force) ? clockevents_program_min_delta(dev) : rc;
+ if (dev->set_next_event(dev->min_delta_ticks, dev)) {
+ if (!force || clockevents_program_min_delta(dev))
+ return -ETIME;
+ }
+ dev->next_event_forced = 1;
+ return 0;
}

/*
diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
index 38dae590b29f..b4cf17b4aeed 100644
--- a/kernel/time/clocksource-wdtest.c
+++ b/kernel/time/clocksource-wdtest.c
@@ -3,202 +3,196 @@
* Unit test for the clocksource watchdog.
*
* Copyright (C) 2021 Facebook, Inc.
+ * Copyright (C) 2026 Intel Corp.
*
* Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
+ * Author: Thomas Gleixner <tglx@xxxxxxxxxx>
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

-#include <linux/device.h>
#include <linux/clocksource.h>
-#include <linux/init.h>
+#include <linux/delay.h>
#include <linux/module.h>
-#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
-#include <linux/tick.h>
#include <linux/kthread.h>
-#include <linux/delay.h>
-#include <linux/prandom.h>
-#include <linux/cpu.h>

#include "tick-internal.h"
+#include "timekeeping_internal.h"

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Clocksource watchdog unit test");
MODULE_AUTHOR("Paul E. McKenney <paulmck@xxxxxxxxxx>");
+MODULE_AUTHOR("Thomas Gleixner <tglx@xxxxxxxxxx>");
+
+enum wdtest_states {
+ WDTEST_INJECT_NONE,
+ WDTEST_INJECT_DELAY,
+ WDTEST_INJECT_POSITIVE,
+ WDTEST_INJECT_NEGATIVE,
+ WDTEST_INJECT_PERCPU = 0x100,
+};

-static int holdoff = IS_BUILTIN(CONFIG_TEST_CLOCKSOURCE_WATCHDOG) ? 10 : 0;
-module_param(holdoff, int, 0444);
-MODULE_PARM_DESC(holdoff, "Time to wait to start test (s).");
+static enum wdtest_states wdtest_state;
+static unsigned long wdtest_test_count;
+static ktime_t wdtest_last_ts, wdtest_offset;

-/* Watchdog kthread's task_struct pointer for debug purposes. */
-static struct task_struct *wdtest_task;
+#define SHIFT_4000PPM 8

-static u64 wdtest_jiffies_read(struct clocksource *cs)
+static ktime_t wdtest_get_offset(struct clocksource *cs)
{
- return (u64)jiffies;
-}
-
-static struct clocksource clocksource_wdtest_jiffies = {
- .name = "wdtest-jiffies",
- .rating = 1, /* lowest valid rating*/
- .uncertainty_margin = TICK_NSEC,
- .read = wdtest_jiffies_read,
- .mask = CLOCKSOURCE_MASK(32),
- .flags = CLOCK_SOURCE_MUST_VERIFY,
- .mult = TICK_NSEC << JIFFIES_SHIFT, /* details above */
- .shift = JIFFIES_SHIFT,
- .max_cycles = 10,
-};
+ if (wdtest_state < WDTEST_INJECT_PERCPU)
+ return wdtest_test_count & 0x1 ? 0 : wdtest_offset >> SHIFT_4000PPM;

-static int wdtest_ktime_read_ndelays;
-static bool wdtest_ktime_read_fuzz;
+ /* Only affect the readout of the "remote" CPU */
+ return cs->wd_cpu == smp_processor_id() ? 0 : NSEC_PER_MSEC;
+}

static u64 wdtest_ktime_read(struct clocksource *cs)
{
- int wkrn = READ_ONCE(wdtest_ktime_read_ndelays);
- static int sign = 1;
- u64 ret;
+ ktime_t now = ktime_get_raw_fast_ns();
+ ktime_t intv = now - wdtest_last_ts;

- if (wkrn) {
- udelay(cs->uncertainty_margin / 250);
- WRITE_ONCE(wdtest_ktime_read_ndelays, wkrn - 1);
- }
- ret = ktime_get_real_fast_ns();
- if (READ_ONCE(wdtest_ktime_read_fuzz)) {
- sign = -sign;
- ret = ret + sign * 100 * NSEC_PER_MSEC;
+ /*
+ * Only increment the test counter once per watchdog interval and
+ * store the interval for the offset calculation of this step. This
+ * guarantees a consistent behaviour even if the other side needs
+ * to repeat due to a watchdog read timeout.
+ */
+ if (intv > (NSEC_PER_SEC / 4)) {
+ WRITE_ONCE(wdtest_test_count, wdtest_test_count + 1);
+ wdtest_last_ts = now;
+ wdtest_offset = intv;
}
- return ret;
-}

-static void wdtest_ktime_cs_mark_unstable(struct clocksource *cs)
-{
- pr_info("--- Marking %s unstable due to clocksource watchdog.\n", cs->name);
+ switch (wdtest_state & ~WDTEST_INJECT_PERCPU) {
+ case WDTEST_INJECT_POSITIVE:
+ return now + wdtest_get_offset(cs);
+ case WDTEST_INJECT_NEGATIVE:
+ return now - wdtest_get_offset(cs);
+ case WDTEST_INJECT_DELAY:
+ udelay(500);
+ return now;
+ default:
+ return now;
+ }
}

-#define KTIME_FLAGS (CLOCK_SOURCE_IS_CONTINUOUS | \
- CLOCK_SOURCE_VALID_FOR_HRES | \
- CLOCK_SOURCE_MUST_VERIFY | \
- CLOCK_SOURCE_VERIFY_PERCPU)
+#define KTIME_FLAGS (CLOCK_SOURCE_IS_CONTINUOUS | \
+ CLOCK_SOURCE_CALIBRATED | \
+ CLOCK_SOURCE_MUST_VERIFY | \
+ CLOCK_SOURCE_WDTEST)

static struct clocksource clocksource_wdtest_ktime = {
.name = "wdtest-ktime",
- .rating = 300,
+ .rating = 10,
.read = wdtest_ktime_read,
.mask = CLOCKSOURCE_MASK(64),
.flags = KTIME_FLAGS,
- .mark_unstable = wdtest_ktime_cs_mark_unstable,
.list = LIST_HEAD_INIT(clocksource_wdtest_ktime.list),
};

-/* Reset the clocksource if needed. */
-static void wdtest_ktime_clocksource_reset(void)
+static void wdtest_clocksource_reset(enum wdtest_states which, bool percpu)
+{
+ clocksource_unregister(&clocksource_wdtest_ktime);
+
+ pr_info("Test: State %d percpu %d\n", which, percpu);
+
+ wdtest_state = which;
+ if (percpu)
+ wdtest_state |= WDTEST_INJECT_PERCPU;
+ wdtest_test_count = 0;
+ wdtest_last_ts = 0;
+
+ clocksource_wdtest_ktime.rating = 10;
+ clocksource_wdtest_ktime.flags = KTIME_FLAGS;
+ if (percpu)
+ clocksource_wdtest_ktime.flags |= CLOCK_SOURCE_WDTEST_PERCPU;
+ clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000);
+}
+
+static bool wdtest_execute(enum wdtest_states which, bool percpu, unsigned int expect,
+ unsigned long calls)
{
- if (clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE) {
- clocksource_unregister(&clocksource_wdtest_ktime);
- clocksource_wdtest_ktime.flags = KTIME_FLAGS;
- schedule_timeout_uninterruptible(HZ / 10);
- clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000);
+ wdtest_clocksource_reset(which, percpu);
+
+ for (; READ_ONCE(wdtest_test_count) < calls; msleep(100)) {
+ unsigned int flags = READ_ONCE(clocksource_wdtest_ktime.flags);
+
+ if (kthread_should_stop())
+ return false;
+
+ if (flags & CLOCK_SOURCE_UNSTABLE) {
+ if (expect & CLOCK_SOURCE_UNSTABLE)
+ return true;
+ pr_warn("Fail: Unexpected unstable\n");
+ return false;
+ }
+ if (flags & CLOCK_SOURCE_VALID_FOR_HRES) {
+ if (expect & CLOCK_SOURCE_VALID_FOR_HRES)
+ return true;
+ pr_warn("Fail: Unexpected valid for highres\n");
+ return false;
+ }
}
+
+ if (!expect)
+ return true;
+
+ pr_warn("Fail: Timed out\n");
+ return false;
}

-/* Run the specified series of watchdog tests. */
-static int wdtest_func(void *arg)
+static bool wdtest_run(bool percpu)
{
- unsigned long j1, j2;
- int i, max_retries;
- char *s;
+ if (!wdtest_execute(WDTEST_INJECT_NONE, percpu, CLOCK_SOURCE_VALID_FOR_HRES, 8))
+ return false;

- schedule_timeout_uninterruptible(holdoff * HZ);
+ if (!wdtest_execute(WDTEST_INJECT_DELAY, percpu, 0, 4))
+ return false;

- /*
- * Verify that jiffies-like clocksources get the manually
- * specified uncertainty margin.
- */
- pr_info("--- Verify jiffies-like uncertainty margin.\n");
- __clocksource_register(&clocksource_wdtest_jiffies);
- WARN_ON_ONCE(clocksource_wdtest_jiffies.uncertainty_margin != TICK_NSEC);
+ if (!wdtest_execute(WDTEST_INJECT_POSITIVE, percpu, CLOCK_SOURCE_UNSTABLE, 8))
+ return false;

- j1 = clocksource_wdtest_jiffies.read(&clocksource_wdtest_jiffies);
- schedule_timeout_uninterruptible(HZ);
- j2 = clocksource_wdtest_jiffies.read(&clocksource_wdtest_jiffies);
- WARN_ON_ONCE(j1 == j2);
+ if (!wdtest_execute(WDTEST_INJECT_NEGATIVE, percpu, CLOCK_SOURCE_UNSTABLE, 8))
+ return false;

- clocksource_unregister(&clocksource_wdtest_jiffies);
+ return true;
+}

- /*
- * Verify that tsc-like clocksources are assigned a reasonable
- * uncertainty margin.
- */
- pr_info("--- Verify tsc-like uncertainty margin.\n");
+static int wdtest_func(void *arg)
+{
clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000);
- WARN_ON_ONCE(clocksource_wdtest_ktime.uncertainty_margin < NSEC_PER_USEC);
-
- j1 = clocksource_wdtest_ktime.read(&clocksource_wdtest_ktime);
- udelay(1);
- j2 = clocksource_wdtest_ktime.read(&clocksource_wdtest_ktime);
- pr_info("--- tsc-like times: %lu - %lu = %lu.\n", j2, j1, j2 - j1);
- WARN_ONCE(time_before(j2, j1 + NSEC_PER_USEC),
- "Expected at least 1000ns, got %lu.\n", j2 - j1);
-
- /* Verify tsc-like stability with various numbers of errors injected. */
- max_retries = clocksource_get_max_watchdog_retry();
- for (i = 0; i <= max_retries + 1; i++) {
- if (i <= 1 && i < max_retries)
- s = "";
- else if (i <= max_retries)
- s = ", expect message";
- else
- s = ", expect clock skew";
- pr_info("--- Watchdog with %dx error injection, %d retries%s.\n", i, max_retries, s);
- WRITE_ONCE(wdtest_ktime_read_ndelays, i);
- schedule_timeout_uninterruptible(2 * HZ);
- WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
- WARN_ON_ONCE((i <= max_retries) !=
- !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
- wdtest_ktime_clocksource_reset();
+ if (wdtest_run(false)) {
+ if (wdtest_run(true))
+ pr_info("Success: All tests passed\n");
}
-
- /* Verify tsc-like stability with clock-value-fuzz error injection. */
- pr_info("--- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.\n");
- WRITE_ONCE(wdtest_ktime_read_fuzz, true);
- schedule_timeout_uninterruptible(2 * HZ);
- WARN_ON_ONCE(!(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
- clocksource_verify_percpu(&clocksource_wdtest_ktime);
- WRITE_ONCE(wdtest_ktime_read_fuzz, false);
-
clocksource_unregister(&clocksource_wdtest_ktime);

- pr_info("--- Done with test.\n");
- return 0;
-}
+ if (!IS_MODULE(CONFIG_TEST_CLOCKSOURCE_WATCHDOG))
+ return 0;

-static void wdtest_print_module_parms(void)
-{
- pr_alert("--- holdoff=%d\n", holdoff);
+ while (!kthread_should_stop())
+ schedule_timeout_interruptible(3600 * HZ);
+ return 0;
}

-/* Cleanup function. */
-static void clocksource_wdtest_cleanup(void)
-{
-}
+static struct task_struct *wdtest_thread;

static int __init clocksource_wdtest_init(void)
{
- int ret = 0;
-
- wdtest_print_module_parms();
+ struct task_struct *t = kthread_run(wdtest_func, NULL, "wdtest");

- /* Create watchdog-test task. */
- wdtest_task = kthread_run(wdtest_func, NULL, "wdtest");
- if (IS_ERR(wdtest_task)) {
- ret = PTR_ERR(wdtest_task);
- pr_warn("%s: Failed to create wdtest kthread.\n", __func__);
- wdtest_task = NULL;
- return ret;
+ if (IS_ERR(t)) {
+ pr_warn("Failed to create wdtest kthread.\n");
+ return PTR_ERR(t);
}
-
+ wdtest_thread = t;
return 0;
}
-
module_init(clocksource_wdtest_init);
+
+static void clocksource_wdtest_cleanup(void)
+{
+ if (wdtest_thread)
+ kthread_stop(wdtest_thread);
+}
module_exit(clocksource_wdtest_cleanup);
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index df7194961658..baee13a1f87f 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -7,15 +7,17 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

-#include <linux/device.h>
#include <linux/clocksource.h>
+#include <linux/cpu.h>
+#include <linux/delay.h>
+#include <linux/device.h>
#include <linux/init.h>
-#include <linux/module.h>
-#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
-#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/module.h>
#include <linux/prandom.h>
-#include <linux/cpu.h>
+#include <linux/sched.h>
+#include <linux/tick.h>
+#include <linux/topology.h>

#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -107,48 +109,6 @@ static char override_name[CS_NAME_LEN];
static int finished_booting;
static u64 suspend_start;

-/*
- * Interval: 0.5sec.
- */
-#define WATCHDOG_INTERVAL (HZ >> 1)
-#define WATCHDOG_INTERVAL_MAX_NS ((2 * WATCHDOG_INTERVAL) * (NSEC_PER_SEC / HZ))
-
-/*
- * Threshold: 0.0312s, when doubled: 0.0625s.
- */
-#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 5)
-
-/*
- * Maximum permissible delay between two readouts of the watchdog
- * clocksource surrounding a read of the clocksource being validated.
- * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
- * a lower bound for cs->uncertainty_margin values when registering clocks.
- *
- * The default of 500 parts per million is based on NTP's limits.
- * If a clocksource is good enough for NTP, it is good enough for us!
- *
- * In other words, by default, even if a clocksource is extremely
- * precise (for example, with a sub-nanosecond period), the maximum
- * permissible skew between the clocksource watchdog and the clocksource
- * under test is not permitted to go below the 500ppm minimum defined
- * by MAX_SKEW_USEC. This 500ppm minimum may be overridden using the
- * CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option.
- */
-#ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
-#define MAX_SKEW_USEC CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
-#else
-#define MAX_SKEW_USEC (125 * WATCHDOG_INTERVAL / HZ)
-#endif
-
-/*
- * Default for maximum permissible skew when cs->uncertainty_margin is
- * not specified, and the lower bound even when cs->uncertainty_margin
- * is specified. This is also the default that is used when registering
- * clocks with unspecified cs->uncertainty_margin, so this macro is used
- * even in CONFIG_CLOCKSOURCE_WATCHDOG=n kernels.
- */
-#define WATCHDOG_MAX_SKEW (MAX_SKEW_USEC * NSEC_PER_USEC)
-
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
static void clocksource_watchdog_work(struct work_struct *work);
static void clocksource_select(void);
@@ -160,7 +120,42 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
static DEFINE_SPINLOCK(watchdog_lock);
static int watchdog_running;
static atomic_t watchdog_reset_pending;
-static int64_t watchdog_max_interval;
+
+/* Watchdog interval: 0.5sec. */
+#define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_INTERVAL_NS (WATCHDOG_INTERVAL * (NSEC_PER_SEC / HZ))
+
+/* Maximum time between two reference watchdog readouts */
+#define WATCHDOG_READOUT_MAX_NS (50U * NSEC_PER_USEC)
+
+/*
+ * Maximum time between two remote readouts for NUMA=n. On NUMA enabled systems
+ * the timeout is calculated from the numa distance.
+ */
+#define WATCHDOG_DEFAULT_TIMEOUT_NS (50U * NSEC_PER_USEC)
+
+/*
+ * Remote timeout NUMA distance multiplier. The local distance is 10. The
+ * default remote distance is 20. ACPI tables provide more accurate numbers
+ * which are guaranteed to be greater than the local distance.
+ *
+ * This results in a 5us base value, which is equivalent to the above !NUMA
+ * default.
+ */
+#define WATCHDOG_NUMA_MULTIPLIER_NS ((u64)(WATCHDOG_DEFAULT_TIMEOUT_NS / LOCAL_DISTANCE))
+
+/* Limit the NUMA timeout in case the distance values are insanely big */
+#define WATCHDOG_NUMA_MAX_TIMEOUT_NS ((u64)(500U * NSEC_PER_USEC))
+
+/* Shift values to calculate the approximate $N ppm of a given delta. */
+#define SHIFT_500PPM 11
+#define SHIFT_4000PPM 8
+
+/* Number of attempts to read the watchdog */
+#define WATCHDOG_FREQ_RETRIES 3
+
+/* Five reads local and remote for inter CPU skew detection */
+#define WATCHDOG_REMOTE_MAX_SEQ 10

static inline void clocksource_watchdog_lock(unsigned long *flags)
{
@@ -241,204 +236,422 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

-static int verify_n_cpus = 8;
-module_param(verify_n_cpus, int, 0644);
+static inline void clocksource_reset_watchdog(void)
+{
+ struct clocksource *cs;

-enum wd_read_status {
- WD_READ_SUCCESS,
- WD_READ_UNSTABLE,
- WD_READ_SKIP
+ list_for_each_entry(cs, &watchdog_list, wd_list)
+ cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+}
+
+enum wd_result {
+ WD_SUCCESS,
+ WD_FREQ_NO_WATCHDOG,
+ WD_FREQ_TIMEOUT,
+ WD_FREQ_RESET,
+ WD_FREQ_SKEWED,
+ WD_CPU_TIMEOUT,
+ WD_CPU_SKEWED,
+};
+
+struct watchdog_cpu_data {
+ /* Keep first as it is 32 byte aligned */
+ call_single_data_t csd;
+ atomic_t remote_inprogress;
+ enum wd_result result;
+ u64 cpu_ts[2];
+ struct clocksource *cs;
+ /* Ensure that the sequence is in a separate cache line */
+ atomic_t seq ____cacheline_aligned;
+ /* Set by the control CPU according to NUMA distance */
+ u64 timeout_ns;
};

-static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
-{
- int64_t md = watchdog->uncertainty_margin;
- unsigned int nretries, max_retries;
- int64_t wd_delay, wd_seq_delay;
- u64 wd_end, wd_end2;
-
- max_retries = clocksource_get_max_watchdog_retry();
- for (nretries = 0; nretries <= max_retries; nretries++) {
- local_irq_disable();
- *wdnow = watchdog->read(watchdog);
- *csnow = cs->read(cs);
- wd_end = watchdog->read(watchdog);
- wd_end2 = watchdog->read(watchdog);
- local_irq_enable();
-
- wd_delay = cycles_to_nsec_safe(watchdog, *wdnow, wd_end);
- if (wd_delay <= md + cs->uncertainty_margin) {
- if (nretries > 1 && nretries >= max_retries) {
- pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
- smp_processor_id(), watchdog->name, nretries);
+struct watchdog_data {
+ raw_spinlock_t lock;
+ enum wd_result result;
+
+ u64 wd_seq;
+ u64 wd_delta;
+ u64 cs_delta;
+ u64 cpu_ts[2];
+
+ unsigned int curr_cpu;
+} ____cacheline_aligned_in_smp;
+
+static void watchdog_check_skew_remote(void *unused);
+
+static DEFINE_PER_CPU_ALIGNED(struct watchdog_cpu_data, watchdog_cpu_data) = {
+ .csd = CSD_INIT(watchdog_check_skew_remote, NULL),
+};
+
+static struct watchdog_data watchdog_data = {
+ .lock = __RAW_SPIN_LOCK_UNLOCKED(watchdog_data.lock),
+};
+
+static inline void watchdog_set_result(struct watchdog_cpu_data *wd, enum wd_result result)
+{
+ guard(raw_spinlock)(&watchdog_data.lock);
+ if (!wd->result) {
+ atomic_set(&wd->seq, WATCHDOG_REMOTE_MAX_SEQ);
+ WRITE_ONCE(wd->result, result);
+ }
+}
+
+/* Wait for the sequence number to hand over control. */
+static bool watchdog_wait_seq(struct watchdog_cpu_data *wd, u64 start, int seq)
+{
+ for(int cnt = 0; atomic_read(&wd->seq) < seq; cnt++) {
+ /* Bail if the other side set an error result */
+ if (READ_ONCE(wd->result) != WD_SUCCESS)
+ return false;
+
+ /* Prevent endless loops if the other CPU does not react. */
+ if (cnt == 5000) {
+ u64 nsecs = ktime_get_raw_fast_ns();
+
+ if (nsecs - start >=wd->timeout_ns) {
+ watchdog_set_result(wd, WD_CPU_TIMEOUT);
+ return false;
}
- return WD_READ_SUCCESS;
+ cnt = 0;
}
+ cpu_relax();
+ }
+ return seq < WATCHDOG_REMOTE_MAX_SEQ;
+}

- /*
- * Now compute delay in consecutive watchdog read to see if
- * there is too much external interferences that cause
- * significant delay in reading both clocksource and watchdog.
- *
- * If consecutive WD read-back delay > md, report
- * system busy, reinit the watchdog and skip the current
- * watchdog test.
- */
- wd_seq_delay = cycles_to_nsec_safe(watchdog, wd_end, wd_end2);
- if (wd_seq_delay > md)
- goto skip_test;
+static void watchdog_check_skew(struct watchdog_cpu_data *wd, int index)
+{
+ u64 prev, now, delta, start = ktime_get_raw_fast_ns();
+ int local = index, remote = (index + 1) & 0x1;
+ struct clocksource *cs = wd->cs;
+
+ /* Set the local timestamp so that the first iteration works correctly */
+ wd->cpu_ts[local] = cs->read(cs);
+
+ /* Signal arrival */
+ atomic_inc(&wd->seq);
+
+ for (int seq = local + 2; seq < WATCHDOG_REMOTE_MAX_SEQ; seq += 2) {
+ if (!watchdog_wait_seq(wd, start, seq))
+ return;
+
+ /* Capture local timestamp before possible non-local coherency overhead */
+ now = cs->read(cs);
+
+ /* Store local timestamp before reading remote to limit coherency stalls */
+ wd->cpu_ts[local] = now;
+
+ prev = wd->cpu_ts[remote];
+ delta = (now - prev) & cs->mask;
+
+ if (delta > cs->max_raw_delta) {
+ watchdog_set_result(wd, WD_CPU_SKEWED);
+ return;
+ }
+
+ /* Hand over to the remote CPU */
+ atomic_inc(&wd->seq);
}
+}

- pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd excessive read-back delay of %lldns vs. limit of %ldns, wd-wd read-back delay only %lldns, attempt %d, marking %s unstable\n",
- smp_processor_id(), cs->name, wd_delay, WATCHDOG_MAX_SKEW, wd_seq_delay, nretries, cs->name);
- return WD_READ_UNSTABLE;
+static void watchdog_check_skew_remote(void *unused)
+{
+ struct watchdog_cpu_data *wd = this_cpu_ptr(&watchdog_cpu_data);

-skip_test:
- pr_info("timekeeping watchdog on CPU%d: %s wd-wd read-back delay of %lldns\n",
- smp_processor_id(), watchdog->name, wd_seq_delay);
- pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n",
- cs->name, wd_delay);
- return WD_READ_SKIP;
+ atomic_inc(&wd->remote_inprogress);
+ watchdog_check_skew(wd, 1);
+ atomic_dec(&wd->remote_inprogress);
}

-static u64 csnow_mid;
-static cpumask_t cpus_ahead;
-static cpumask_t cpus_behind;
-static cpumask_t cpus_chosen;
+static inline bool wd_csd_locked(struct watchdog_cpu_data *wd)
+{
+ return READ_ONCE(wd->csd.node.u_flags) & CSD_FLAG_LOCK;
+}
+
+/*
+ * This is only invoked for remote CPUs. See watchdog_check_cpu_skew().
+ */
+static inline u64 wd_get_remote_timeout(unsigned int remote_cpu)
+{
+ unsigned int n1, n2;
+ u64 ns;
+
+ if (nr_node_ids == 1)
+ return WATCHDOG_DEFAULT_TIMEOUT_NS;
+
+ n1 = cpu_to_node(smp_processor_id());
+ n2 = cpu_to_node(remote_cpu);
+ ns = WATCHDOG_NUMA_MULTIPLIER_NS * node_distance(n1, n2);
+ return min(ns, WATCHDOG_NUMA_MAX_TIMEOUT_NS);
+}

-static void clocksource_verify_choose_cpus(void)
+static void __watchdog_check_cpu_skew(struct clocksource *cs, unsigned int cpu)
{
- int cpu, i, n = verify_n_cpus;
+ struct watchdog_cpu_data *wd;

- if (n < 0 || n >= num_online_cpus()) {
- /* Check all of the CPUs. */
- cpumask_copy(&cpus_chosen, cpu_online_mask);
- cpumask_clear_cpu(smp_processor_id(), &cpus_chosen);
+ wd = per_cpu_ptr(&watchdog_cpu_data, cpu);
+ if (atomic_read(&wd->remote_inprogress) || wd_csd_locked(wd)) {
+ watchdog_data.result = WD_CPU_TIMEOUT;
return;
}

- /* If no checking desired, or no other CPU to check, leave. */
- cpumask_clear(&cpus_chosen);
- if (n == 0 || num_online_cpus() <= 1)
+ atomic_set(&wd->seq, 0);
+ wd->result = WD_SUCCESS;
+ wd->cs = cs;
+ /* Store the current CPU ID for the watchdog test unit */
+ cs->wd_cpu = smp_processor_id();
+
+ wd->timeout_ns = wd_get_remote_timeout(cpu);
+
+ /* Kick the remote CPU into the watchdog function */
+ if (WARN_ON_ONCE(smp_call_function_single_async(cpu, &wd->csd))) {
+ watchdog_data.result = WD_CPU_TIMEOUT;
+ return;
+ }
+
+ scoped_guard(irq)
+ watchdog_check_skew(wd, 0);
+
+ scoped_guard(raw_spinlock_irq, &watchdog_data.lock) {
+ watchdog_data.result = wd->result;
+ memcpy(watchdog_data.cpu_ts, wd->cpu_ts, sizeof(wd->cpu_ts));
+ }
+}
+
+static void watchdog_check_cpu_skew(struct clocksource *cs)
+{
+ unsigned int cpu = watchdog_data.curr_cpu;
+
+ cpu = cpumask_next_wrap(cpu, cpu_online_mask);
+ watchdog_data.curr_cpu = cpu;
+
+ /* Skip the current CPU. Handles num_online_cpus() == 1 as well */
+ if (cpu == smp_processor_id())
return;

- /* Make sure to select at least one CPU other than the current CPU. */
- cpu = cpumask_any_but(cpu_online_mask, smp_processor_id());
- if (WARN_ON_ONCE(cpu >= nr_cpu_ids))
+ /* Don't interfere with the test mechanics */
+ if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU))
return;
- cpumask_set_cpu(cpu, &cpus_chosen);

- /* Force a sane value for the boot parameter. */
- if (n > nr_cpu_ids)
- n = nr_cpu_ids;
+ __watchdog_check_cpu_skew(cs, cpu);
+}
+
+static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending)
+{
+ unsigned int ppm_shift = SHIFT_4000PPM;
+ u64 wd_ts0, wd_ts1, cs_ts;
+
+ watchdog_data.result = WD_SUCCESS;
+ if (!watchdog) {
+ watchdog_data.result = WD_FREQ_NO_WATCHDOG;
+ return false;
+ }
+
+ if (cs->flags & CLOCK_SOURCE_WDTEST_PERCPU)
+ return true;

/*
- * Randomly select the specified number of CPUs. If the same
- * CPU is selected multiple times, that CPU is checked only once,
- * and no replacement CPU is selected. This gracefully handles
- * situations where verify_n_cpus is greater than the number of
- * CPUs that are currently online.
+ * If both the clocksource and the watchdog claim they are
+ * calibrated use 500ppm limit. Uncalibrated clocksources need a
+ * larger allowance because thefirmware supplied frequencies can be
+ * way off.
*/
- for (i = 1; i < n; i++) {
- cpu = cpumask_random(cpu_online_mask);
- if (!WARN_ON_ONCE(cpu >= nr_cpu_ids))
- cpumask_set_cpu(cpu, &cpus_chosen);
+ if (watchdog->flags & CLOCK_SOURCE_CALIBRATED && cs->flags & CLOCK_SOURCE_CALIBRATED)
+ ppm_shift = SHIFT_500PPM;
+
+ for (int retries = 0; retries < WATCHDOG_FREQ_RETRIES; retries++) {
+ s64 wd_last, cs_last, wd_seq, wd_delta, cs_delta, max_delta;
+
+ scoped_guard(irq) {
+ wd_ts0 = watchdog->read(watchdog);
+ cs_ts = cs->read(cs);
+ wd_ts1 = watchdog->read(watchdog);
+ }
+
+ wd_last = cs->wd_last;
+ cs_last = cs->cs_last;
+
+ /* Validate the watchdog readout window */
+ wd_seq = cycles_to_nsec_safe(watchdog, wd_ts0, wd_ts1);
+ if (wd_seq > WATCHDOG_READOUT_MAX_NS) {
+ /* Store for printout in case all retries fail */
+ watchdog_data.wd_seq = wd_seq;
+ continue;
+ }
+
+ /* Store for subsequent processing */
+ cs->wd_last = wd_ts0;
+ cs->cs_last = cs_ts;
+
+ /* First round or reset pending? */
+ if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || reset_pending)
+ goto reset;
+
+ /* Calculate the nanosecond deltas from the last invocation */
+ wd_delta = cycles_to_nsec_safe(watchdog, wd_last, wd_ts0);
+ cs_delta = cycles_to_nsec_safe(cs, cs_last, cs_ts);
+
+ watchdog_data.wd_delta = wd_delta;
+ watchdog_data.cs_delta = cs_delta;
+
+ /*
+ * Ensure that the deltas are within the readout limits of
+ * the clocksource and the watchdog. Long delays can cause
+ * clocksources to overflow.
+ */
+ max_delta = max(wd_delta, cs_delta);
+ if (max_delta > cs->max_idle_ns || max_delta > watchdog->max_idle_ns)
+ goto reset;
+
+ /*
+ * Calculate and validate the skew against the allowed PPM
+ * value of the maximum delta plus the watchdog readout
+ * time.
+ */
+ if (abs(wd_delta - cs_delta) < (max_delta >> ppm_shift) + wd_seq)
+ return true;
+
+ watchdog_data.result = WD_FREQ_SKEWED;
+ return false;
}

- /* Don't verify ourselves. */
- cpumask_clear_cpu(smp_processor_id(), &cpus_chosen);
+ watchdog_data.result = WD_FREQ_TIMEOUT;
+ return false;
+
+reset:
+ cs->flags |= CLOCK_SOURCE_WATCHDOG;
+ watchdog_data.result = WD_FREQ_RESET;
+ return false;
}

-static void clocksource_verify_one_cpu(void *csin)
+/* Synchronization for sched clock */
+static void clocksource_tick_stable(struct clocksource *cs)
{
- struct clocksource *cs = (struct clocksource *)csin;
-
- csnow_mid = cs->read(cs);
+ if (cs == curr_clocksource && cs->tick_stable)
+ cs->tick_stable(cs);
}

-void clocksource_verify_percpu(struct clocksource *cs)
+/* Conditionaly enable high resolution mode */
+static void clocksource_enable_highres(struct clocksource *cs)
{
- int64_t cs_nsec, cs_nsec_max = 0, cs_nsec_min = LLONG_MAX;
- u64 csnow_begin, csnow_end;
- int cpu, testcpu;
- s64 delta;
+ if ((cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) ||
+ !(cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) ||
+ !watchdog || !(watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS))
+ return;
+
+ /* Mark it valid for high-res. */
+ cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;

- if (verify_n_cpus == 0)
+ /*
+ * Can't schedule work before finished_booting is
+ * true. clocksource_done_booting will take care of it.
+ */
+ if (!finished_booting)
return;
- cpumask_clear(&cpus_ahead);
- cpumask_clear(&cpus_behind);
- cpus_read_lock();
- migrate_disable();
- clocksource_verify_choose_cpus();
- if (cpumask_empty(&cpus_chosen)) {
- migrate_enable();
- cpus_read_unlock();
- pr_warn("Not enough CPUs to check clocksource '%s'.\n", cs->name);
+
+ if (cs->flags & CLOCK_SOURCE_WDTEST)
return;
+
+ /*
+ * If this is not the current clocksource let the watchdog thread
+ * reselect it. Due to the change to high res this clocksource
+ * might be preferred now. If it is the current clocksource let the
+ * tick code know about that change.
+ */
+ if (cs != curr_clocksource) {
+ cs->flags |= CLOCK_SOURCE_RESELECT;
+ schedule_work(&watchdog_work);
+ } else {
+ tick_clock_notify();
}
- testcpu = smp_processor_id();
- pr_info("Checking clocksource %s synchronization from CPU %d to CPUs %*pbl.\n",
- cs->name, testcpu, cpumask_pr_args(&cpus_chosen));
- preempt_disable();
- for_each_cpu(cpu, &cpus_chosen) {
- if (cpu == testcpu)
- continue;
- csnow_begin = cs->read(cs);
- smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
- csnow_end = cs->read(cs);
- delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
- if (delta < 0)
- cpumask_set_cpu(cpu, &cpus_behind);
- delta = (csnow_end - csnow_mid) & cs->mask;
- if (delta < 0)
- cpumask_set_cpu(cpu, &cpus_ahead);
- cs_nsec = cycles_to_nsec_safe(cs, csnow_begin, csnow_end);
- if (cs_nsec > cs_nsec_max)
- cs_nsec_max = cs_nsec;
- if (cs_nsec < cs_nsec_min)
- cs_nsec_min = cs_nsec;
+}
+
+static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 2);
+
+static void watchdog_print_freq_timeout(struct clocksource *cs)
+{
+ if (!__ratelimit(&ratelimit_state))
+ return;
+ pr_info("Watchdog %s read timed out. Readout sequence took: %lluns\n",
+ watchdog->name, watchdog_data.wd_seq);
+}
+
+static void watchdog_print_freq_skew(struct clocksource *cs)
+{
+ pr_warn("Marking clocksource %s unstable due to frequency skew\n", cs->name);
+ pr_warn("Watchdog %20s interval: %16lluns\n", watchdog->name, watchdog_data.wd_delta);
+ pr_warn("Clocksource %20s interval: %16lluns\n", cs->name, watchdog_data.cs_delta);
+}
+
+static void watchdog_handle_remote_timeout(struct clocksource *cs)
+{
+ pr_info_once("Watchdog remote CPU %u read timed out\n", watchdog_data.curr_cpu);
+}
+
+static void watchdog_print_remote_skew(struct clocksource *cs)
+{
+ pr_warn("Marking clocksource %s unstable due to inter CPU skew\n", cs->name);
+ if (watchdog_data.cpu_ts[0] < watchdog_data.cpu_ts[1]) {
+ pr_warn("CPU%u %16llu < CPU%u %16llu (cycles)\n", smp_processor_id(),
+ watchdog_data.cpu_ts[0], watchdog_data.curr_cpu, watchdog_data.cpu_ts[1]);
+ } else {
+ pr_warn("CPU%u %16llu < CPU%u %16llu (cycles)\n", watchdog_data.curr_cpu,
+ watchdog_data.cpu_ts[1], smp_processor_id(), watchdog_data.cpu_ts[0]);
}
- preempt_enable();
- migrate_enable();
- cpus_read_unlock();
- if (!cpumask_empty(&cpus_ahead))
- pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
- cpumask_pr_args(&cpus_ahead), testcpu, cs->name);
- if (!cpumask_empty(&cpus_behind))
- pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
- cpumask_pr_args(&cpus_behind), testcpu, cs->name);
- pr_info(" CPU %d check durations %lldns - %lldns for clocksource %s.\n",
- testcpu, cs_nsec_min, cs_nsec_max, cs->name);
-}
-EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
+}

-static inline void clocksource_reset_watchdog(void)
+static void watchdog_check_result(struct clocksource *cs)
{
- struct clocksource *cs;
+ switch (watchdog_data.result) {
+ case WD_SUCCESS:
+ clocksource_tick_stable(cs);
+ clocksource_enable_highres(cs);
+ return;

- list_for_each_entry(cs, &watchdog_list, wd_list)
+ case WD_FREQ_TIMEOUT:
+ watchdog_print_freq_timeout(cs);
+ /* Try again later and invalidate the reference timestamps. */
cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
-}
+ return;

+ case WD_FREQ_NO_WATCHDOG:
+ case WD_FREQ_RESET:
+ /*
+ * Nothing to do when the reference timestamps were reset
+ * or no watchdog clocksource registered.
+ */
+ return;
+
+ case WD_FREQ_SKEWED:
+ watchdog_print_freq_skew(cs);
+ break;
+
+ case WD_CPU_TIMEOUT:
+ /* Remote check timed out. Try again next cycle. */
+ watchdog_handle_remote_timeout(cs);
+ return;
+
+ case WD_CPU_SKEWED:
+ watchdog_print_remote_skew(cs);
+ break;
+ }
+ __clocksource_unstable(cs);
+}

static void clocksource_watchdog(struct timer_list *unused)
{
- int64_t wd_nsec, cs_nsec, interval;
- u64 csnow, wdnow, cslast, wdlast;
- int next_cpu, reset_pending;
struct clocksource *cs;
- enum wd_read_status read_ret;
- unsigned long extra_wait = 0;
- u32 md;
+ bool reset_pending;

- spin_lock(&watchdog_lock);
+ guard(spinlock)(&watchdog_lock);
if (!watchdog_running)
- goto out;
+ return;

reset_pending = atomic_read(&watchdog_reset_pending);

list_for_each_entry(cs, &watchdog_list, wd_list) {
-
/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
if (finished_booting)
@@ -446,170 +659,40 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

- read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
-
- if (read_ret == WD_READ_UNSTABLE) {
- /* Clock readout unreliable, so give it up. */
- __clocksource_unstable(cs);
- continue;
- }
-
- /*
- * When WD_READ_SKIP is returned, it means the system is likely
- * under very heavy load, where the latency of reading
- * watchdog/clocksource is very big, and affect the accuracy of
- * watchdog check. So give system some space and suspend the
- * watchdog check for 5 minutes.
- */
- if (read_ret == WD_READ_SKIP) {
- /*
- * As the watchdog timer will be suspended, and
- * cs->last could keep unchanged for 5 minutes, reset
- * the counters.
- */
- clocksource_reset_watchdog();
- extra_wait = HZ * 300;
- break;
- }
-
- /* Clocksource initialized ? */
- if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
- atomic_read(&watchdog_reset_pending)) {
- cs->flags |= CLOCK_SOURCE_WATCHDOG;
- cs->wd_last = wdnow;
- cs->cs_last = csnow;
- continue;
+ /* Compare against watchdog clocksource if available */
+ if (watchdog_check_freq(cs, reset_pending)) {
+ /* Check for inter CPU skew */
+ watchdog_check_cpu_skew(cs);
}

- wd_nsec = cycles_to_nsec_safe(watchdog, cs->wd_last, wdnow);
- cs_nsec = cycles_to_nsec_safe(cs, cs->cs_last, csnow);
- wdlast = cs->wd_last; /* save these in case we print them */
- cslast = cs->cs_last;
- cs->cs_last = csnow;
- cs->wd_last = wdnow;
-
- if (atomic_read(&watchdog_reset_pending))
- continue;
-
- /*
- * The processing of timer softirqs can get delayed (usually
- * on account of ksoftirqd not getting to run in a timely
- * manner), which causes the watchdog interval to stretch.
- * Skew detection may fail for longer watchdog intervals
- * on account of fixed margins being used.
- * Some clocksources, e.g. acpi_pm, cannot tolerate
- * watchdog intervals longer than a few seconds.
- */
- interval = max(cs_nsec, wd_nsec);
- if (unlikely(interval > WATCHDOG_INTERVAL_MAX_NS)) {
- if (system_state > SYSTEM_SCHEDULING &&
- interval > 2 * watchdog_max_interval) {
- watchdog_max_interval = interval;
- pr_warn("Long readout interval, skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
- cs_nsec, wd_nsec);
- }
- watchdog_timer.expires = jiffies;
- continue;
- }
-
- /* Check the deviation from the watchdog clocksource. */
- md = cs->uncertainty_margin + watchdog->uncertainty_margin;
- if (abs(cs_nsec - wd_nsec) > md) {
- s64 cs_wd_msec;
- s64 wd_msec;
- u32 wd_rem;
-
- pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
- smp_processor_id(), cs->name);
- pr_warn(" '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n",
- watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
- pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
- cs->name, cs_nsec, csnow, cslast, cs->mask);
- cs_wd_msec = div_s64_rem(cs_nsec - wd_nsec, 1000 * 1000, &wd_rem);
- wd_msec = div_s64_rem(wd_nsec, 1000 * 1000, &wd_rem);
- pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
- cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
- if (curr_clocksource == cs)
- pr_warn(" '%s' is current clocksource.\n", cs->name);
- else if (curr_clocksource)
- pr_warn(" '%s' (not '%s') is current clocksource.\n", curr_clocksource->name, cs->name);
- else
- pr_warn(" No current clocksource.\n");
- __clocksource_unstable(cs);
- continue;
- }
-
- if (cs == curr_clocksource && cs->tick_stable)
- cs->tick_stable(cs);
-
- if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
- (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
- (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {
- /* Mark it valid for high-res. */
- cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
-
- /*
- * clocksource_done_booting() will sort it if
- * finished_booting is not set yet.
- */
- if (!finished_booting)
- continue;
-
- /*
- * If this is not the current clocksource let
- * the watchdog thread reselect it. Due to the
- * change to high res this clocksource might
- * be preferred now. If it is the current
- * clocksource let the tick code know about
- * that change.
- */
- if (cs != curr_clocksource) {
- cs->flags |= CLOCK_SOURCE_RESELECT;
- schedule_work(&watchdog_work);
- } else {
- tick_clock_notify();
- }
- }
+ watchdog_check_result(cs);
}

- /*
- * We only clear the watchdog_reset_pending, when we did a
- * full cycle through all clocksources.
- */
+ /* Clear after the full clocksource walk */
if (reset_pending)
atomic_dec(&watchdog_reset_pending);

- /*
- * Cycle through CPUs to check if the CPUs stay synchronized
- * to each other.
- */
- next_cpu = cpumask_next_wrap(raw_smp_processor_id(), cpu_online_mask);
-
- /*
- * Arm timer if not already pending: could race with concurrent
- * pair clocksource_stop_watchdog() clocksource_start_watchdog().
- */
+ /* Could have been rearmed by a stop/start cycle */
if (!timer_pending(&watchdog_timer)) {
- watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
- add_timer_on(&watchdog_timer, next_cpu);
+ watchdog_timer.expires += WATCHDOG_INTERVAL;
+ add_timer_local(&watchdog_timer);
}
-out:
- spin_unlock(&watchdog_lock);
}

static inline void clocksource_start_watchdog(void)
{
- if (watchdog_running || !watchdog || list_empty(&watchdog_list))
+ if (watchdog_running || list_empty(&watchdog_list))
return;
- timer_setup(&watchdog_timer, clocksource_watchdog, 0);
+ timer_setup(&watchdog_timer, clocksource_watchdog, TIMER_PINNED);
watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
- add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));
+
+ add_timer_on(&watchdog_timer, get_boot_cpu_id());
watchdog_running = 1;
}

static inline void clocksource_stop_watchdog(void)
{
- if (!watchdog_running || (watchdog && !list_empty(&watchdog_list)))
+ if (!watchdog_running || !list_empty(&watchdog_list))
return;
timer_delete(&watchdog_timer);
watchdog_running = 0;
@@ -651,6 +734,13 @@ static void clocksource_select_watchdog(bool fallback)
if (cs->flags & CLOCK_SOURCE_MUST_VERIFY)
continue;

+ /*
+ * If it's not continuous, don't put the fox in charge of
+ * the henhouse.
+ */
+ if (!(cs->flags & CLOCK_SOURCE_IS_CONTINUOUS))
+ continue;
+
/* Skip current if we were requested for a fallback. */
if (fallback && cs == old_wd)
continue;
@@ -690,12 +780,6 @@ static int __clocksource_watchdog_kthread(void)
unsigned long flags;
int select = 0;

- /* Do any required per-CPU skew verification. */
- if (curr_clocksource &&
- curr_clocksource->flags & CLOCK_SOURCE_UNSTABLE &&
- curr_clocksource->flags & CLOCK_SOURCE_VERIFY_PERCPU)
- clocksource_verify_percpu(curr_clocksource);
-
spin_lock_irqsave(&watchdog_lock, flags);
list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -1016,6 +1100,8 @@ static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
continue;
if (oneshot && !(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES))
continue;
+ if (cs->flags & CLOCK_SOURCE_WDTEST)
+ continue;
return cs;
}
return NULL;
@@ -1040,6 +1126,8 @@ static void __clocksource_select(bool skipcur)
continue;
if (strcmp(cs->name, override_name) != 0)
continue;
+ if (cs->flags & CLOCK_SOURCE_WDTEST)
+ continue;
/*
* Check to make sure we don't switch to a non-highres
* capable clocksource if the tick code is in oneshot
@@ -1169,31 +1257,10 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq

clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
NSEC_PER_SEC / scale, sec * scale);
- }

- /*
- * If the uncertainty margin is not specified, calculate it. If
- * both scale and freq are non-zero, calculate the clock period, but
- * bound below at 2*WATCHDOG_MAX_SKEW, that is, 500ppm by default.
- * However, if either of scale or freq is zero, be very conservative
- * and take the tens-of-milliseconds WATCHDOG_THRESHOLD value
- * for the uncertainty margin. Allow stupidly small uncertainty
- * margins to be specified by the caller for testing purposes,
- * but warn to discourage production use of this capability.
- *
- * Bottom line: The sum of the uncertainty margins of the
- * watchdog clocksource and the clocksource under test will be at
- * least 500ppm by default. For more information, please see the
- * comment preceding CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US above.
- */
- if (scale && freq && !cs->uncertainty_margin) {
- cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
- if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
- cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
- } else if (!cs->uncertainty_margin) {
- cs->uncertainty_margin = WATCHDOG_THRESHOLD;
+ /* Update cs::freq_khz */
+ cs->freq_khz = div_u64((u64)freq * scale, 1000);
}
- WARN_ON_ONCE(cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW);

/*
* Ensure clocksources that have large 'mult' values don't overflow
@@ -1241,6 +1308,10 @@ int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)

if (WARN_ON_ONCE((unsigned int)cs->id >= CSID_MAX))
cs->id = CSID_GENERIC;
+
+ if (WARN_ON_ONCE(!freq && cs->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT))
+ cs->flags &= ~CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT;
+
if (cs->vdso_clock_mode < 0 ||
cs->vdso_clock_mode >= VDSO_CLOCKMODE_MAX) {
pr_warn("clocksource %s registered with invalid VDSO mode %d. Disabling VDSO support.\n",
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 860af7a58428..5bd6efe598f0 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -49,6 +49,28 @@

#include "tick-internal.h"

+/*
+ * Constants to set the queued state of the timer (INACTIVE, ENQUEUED)
+ *
+ * The callback state is kept separate in the CPU base because having it in
+ * the timer would required touching the timer after the callback, which
+ * makes it impossible to free the timer from the callback function.
+ *
+ * Therefore we track the callback state in:
+ *
+ * timer->base->cpu_base->running == timer
+ *
+ * On SMP it is possible to have a "callback function running and enqueued"
+ * status. It happens for example when a posix timer expired and the callback
+ * queued a signal. Between dropping the lock which protects the posix timer
+ * and reacquiring the base lock of the hrtimer, another CPU can deliver the
+ * signal and rearm the timer.
+ *
+ * All state transitions are protected by cpu_base->lock.
+ */
+#define HRTIMER_STATE_INACTIVE false
+#define HRTIMER_STATE_ENQUEUED true
+
/*
* The resolution of the clocks. The resolution value is returned in
* the clock_getres() system call to give application programmers an
@@ -77,43 +99,22 @@ static ktime_t __hrtimer_cb_get_time(clockid_t clock_id);
* to reach a base using a clockid, hrtimer_clockid_to_base()
* is used to convert from clockid to the proper hrtimer_base_type.
*/
+
+#define BASE_INIT(idx, cid) \
+ [idx] = { .index = idx, .clockid = cid }
+
DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
{
.lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock),
- .clock_base =
- {
- {
- .index = HRTIMER_BASE_MONOTONIC,
- .clockid = CLOCK_MONOTONIC,
- },
- {
- .index = HRTIMER_BASE_REALTIME,
- .clockid = CLOCK_REALTIME,
- },
- {
- .index = HRTIMER_BASE_BOOTTIME,
- .clockid = CLOCK_BOOTTIME,
- },
- {
- .index = HRTIMER_BASE_TAI,
- .clockid = CLOCK_TAI,
- },
- {
- .index = HRTIMER_BASE_MONOTONIC_SOFT,
- .clockid = CLOCK_MONOTONIC,
- },
- {
- .index = HRTIMER_BASE_REALTIME_SOFT,
- .clockid = CLOCK_REALTIME,
- },
- {
- .index = HRTIMER_BASE_BOOTTIME_SOFT,
- .clockid = CLOCK_BOOTTIME,
- },
- {
- .index = HRTIMER_BASE_TAI_SOFT,
- .clockid = CLOCK_TAI,
- },
+ .clock_base = {
+ BASE_INIT(HRTIMER_BASE_MONOTONIC, CLOCK_MONOTONIC),
+ BASE_INIT(HRTIMER_BASE_REALTIME, CLOCK_REALTIME),
+ BASE_INIT(HRTIMER_BASE_BOOTTIME, CLOCK_BOOTTIME),
+ BASE_INIT(HRTIMER_BASE_TAI, CLOCK_TAI),
+ BASE_INIT(HRTIMER_BASE_MONOTONIC_SOFT, CLOCK_MONOTONIC),
+ BASE_INIT(HRTIMER_BASE_REALTIME_SOFT, CLOCK_REALTIME),
+ BASE_INIT(HRTIMER_BASE_BOOTTIME_SOFT, CLOCK_BOOTTIME),
+ BASE_INIT(HRTIMER_BASE_TAI_SOFT, CLOCK_TAI),
},
.csd = CSD_INIT(retrigger_next_event, NULL)
};
@@ -126,23 +127,43 @@ static inline bool hrtimer_base_is_online(struct hrtimer_cpu_base *base)
return likely(base->online);
}

+#ifdef CONFIG_HIGH_RES_TIMERS
+DEFINE_STATIC_KEY_FALSE(hrtimer_highres_enabled_key);
+
+static void hrtimer_hres_workfn(struct work_struct *work)
+{
+ static_branch_enable(&hrtimer_highres_enabled_key);
+}
+
+static DECLARE_WORK(hrtimer_hres_work, hrtimer_hres_workfn);
+
+static inline void hrtimer_schedule_hres_work(void)
+{
+ if (!hrtimer_highres_enabled())
+ schedule_work(&hrtimer_hres_work);
+}
+#else
+static inline void hrtimer_schedule_hres_work(void) { }
+#endif
+
/*
* Functions and macros which are different for UP/SMP systems are kept in a
* single place
*/
#ifdef CONFIG_SMP
-
/*
* We require the migration_base for lock_hrtimer_base()/switch_hrtimer_base()
* such that hrtimer_callback_running() can unconditionally dereference
* timer->base->cpu_base
*/
static struct hrtimer_cpu_base migration_cpu_base = {
- .clock_base = { {
- .cpu_base = &migration_cpu_base,
- .seq = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq,
- &migration_cpu_base.lock),
- }, },
+ .clock_base = {
+ [0] = {
+ .cpu_base = &migration_cpu_base,
+ .seq = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq,
+ &migration_cpu_base.lock),
+ },
+ },
};

#define migration_base migration_cpu_base.clock_base[0]
@@ -159,15 +180,13 @@ static struct hrtimer_cpu_base migration_cpu_base = {
* possible to set timer->base = &migration_base and drop the lock: the timer
* remains locked.
*/
-static
-struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
- unsigned long *flags)
+static struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
+ unsigned long *flags)
__acquires(&timer->base->lock)
{
- struct hrtimer_clock_base *base;
-
for (;;) {
- base = READ_ONCE(timer->base);
+ struct hrtimer_clock_base *base = READ_ONCE(timer->base);
+
if (likely(base != &migration_base)) {
raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
if (likely(base == timer->base))
@@ -220,7 +239,7 @@ static bool hrtimer_suitable_target(struct hrtimer *timer, struct hrtimer_clock_
return expires >= new_base->cpu_base->expires_next;
}

-static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, int pinned)
+static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, bool pinned)
{
if (!hrtimer_base_is_online(base)) {
int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
@@ -248,8 +267,7 @@ static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *
* the timer callback is currently running.
*/
static inline struct hrtimer_clock_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
- int pinned)
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base, bool pinned)
{
struct hrtimer_cpu_base *new_cpu_base, *this_cpu_base;
struct hrtimer_clock_base *new_base;
@@ -262,13 +280,12 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,

if (base != new_base) {
/*
- * We are trying to move timer to new_base.
- * However we can't change timer's base while it is running,
- * so we keep it on the same CPU. No hassle vs. reprogramming
- * the event source in the high resolution case. The softirq
- * code will take care of this when the timer function has
- * completed. There is no conflict as we hold the lock until
- * the timer is enqueued.
+ * We are trying to move timer to new_base. However we can't
+ * change timer's base while it is running, so we keep it on
+ * the same CPU. No hassle vs. reprogramming the event source
+ * in the high resolution case. The remote CPU will take care
+ * of this when the timer function has completed. There is no
+ * conflict as we hold the lock until the timer is enqueued.
*/
if (unlikely(hrtimer_callback_running(timer)))
return base;
@@ -278,8 +295,7 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
raw_spin_unlock(&base->cpu_base->lock);
raw_spin_lock(&new_base->cpu_base->lock);

- if (!hrtimer_suitable_target(timer, new_base, new_cpu_base,
- this_cpu_base)) {
+ if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, this_cpu_base)) {
raw_spin_unlock(&new_base->cpu_base->lock);
raw_spin_lock(&base->cpu_base->lock);
new_cpu_base = this_cpu_base;
@@ -298,14 +314,13 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,

#else /* CONFIG_SMP */

-static inline struct hrtimer_clock_base *
-lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
+static inline struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
+ unsigned long *flags)
__acquires(&timer->base->cpu_base->lock)
{
struct hrtimer_clock_base *base = timer->base;

raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
-
return base;
}

@@ -340,7 +355,7 @@ s64 __ktime_divns(const ktime_t kt, s64 div)
return dclc < 0 ? -tmp : tmp;
}
EXPORT_SYMBOL_GPL(__ktime_divns);
-#endif /* BITS_PER_LONG >= 64 */
+#endif /* BITS_PER_LONG < 64 */

/*
* Add two ktime values and do a safety check for overflow:
@@ -422,12 +437,37 @@ static bool hrtimer_fixup_free(void *addr, enum debug_obj_state state)
}
}

+/* Stub timer callback for improperly used timers. */
+static enum hrtimer_restart stub_timer(struct hrtimer *unused)
+{
+ WARN_ON_ONCE(1);
+ return HRTIMER_NORESTART;
+}
+
+/*
+ * hrtimer_fixup_assert_init is called when:
+ * - an untracked/uninit-ed object is found
+ */
+static bool hrtimer_fixup_assert_init(void *addr, enum debug_obj_state state)
+{
+ struct hrtimer *timer = addr;
+
+ switch (state) {
+ case ODEBUG_STATE_NOTAVAILABLE:
+ hrtimer_setup(timer, stub_timer, CLOCK_MONOTONIC, 0);
+ return true;
+ default:
+ return false;
+ }
+}
+
static const struct debug_obj_descr hrtimer_debug_descr = {
- .name = "hrtimer",
- .debug_hint = hrtimer_debug_hint,
- .fixup_init = hrtimer_fixup_init,
- .fixup_activate = hrtimer_fixup_activate,
- .fixup_free = hrtimer_fixup_free,
+ .name = "hrtimer",
+ .debug_hint = hrtimer_debug_hint,
+ .fixup_init = hrtimer_fixup_init,
+ .fixup_activate = hrtimer_fixup_activate,
+ .fixup_free = hrtimer_fixup_free,
+ .fixup_assert_init = hrtimer_fixup_assert_init,
};

static inline void debug_hrtimer_init(struct hrtimer *timer)
@@ -440,8 +480,7 @@ static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer)
debug_object_init_on_stack(timer, &hrtimer_debug_descr);
}

-static inline void debug_hrtimer_activate(struct hrtimer *timer,
- enum hrtimer_mode mode)
+static inline void debug_hrtimer_activate(struct hrtimer *timer, enum hrtimer_mode mode)
{
debug_object_activate(timer, &hrtimer_debug_descr);
}
@@ -451,6 +490,11 @@ static inline void debug_hrtimer_deactivate(struct hrtimer *timer)
debug_object_deactivate(timer, &hrtimer_debug_descr);
}

+static inline void debug_hrtimer_assert_init(struct hrtimer *timer)
+{
+ debug_object_assert_init(timer, &hrtimer_debug_descr);
+}
+
void destroy_hrtimer_on_stack(struct hrtimer *timer)
{
debug_object_free(timer, &hrtimer_debug_descr);
@@ -461,9 +505,9 @@ EXPORT_SYMBOL_GPL(destroy_hrtimer_on_stack);

static inline void debug_hrtimer_init(struct hrtimer *timer) { }
static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer) { }
-static inline void debug_hrtimer_activate(struct hrtimer *timer,
- enum hrtimer_mode mode) { }
+static inline void debug_hrtimer_activate(struct hrtimer *timer, enum hrtimer_mode mode) { }
static inline void debug_hrtimer_deactivate(struct hrtimer *timer) { }
+static inline void debug_hrtimer_assert_init(struct hrtimer *timer) { }
#endif

static inline void debug_setup(struct hrtimer *timer, clockid_t clockid, enum hrtimer_mode mode)
@@ -479,80 +523,80 @@ static inline void debug_setup_on_stack(struct hrtimer *timer, clockid_t clockid
trace_hrtimer_setup(timer, clockid, mode);
}

-static inline void debug_activate(struct hrtimer *timer,
- enum hrtimer_mode mode)
+static inline void debug_activate(struct hrtimer *timer, enum hrtimer_mode mode, bool was_armed)
{
debug_hrtimer_activate(timer, mode);
- trace_hrtimer_start(timer, mode);
+ trace_hrtimer_start(timer, mode, was_armed);
}

-static inline void debug_deactivate(struct hrtimer *timer)
-{
- debug_hrtimer_deactivate(timer);
- trace_hrtimer_cancel(timer);
-}
+#define for_each_active_base(base, cpu_base, active) \
+ for (unsigned int idx = ffs(active); idx--; idx = ffs((active))) \
+ for (bool done = false; !done; active &= ~(1U << idx)) \
+ for (base = &cpu_base->clock_base[idx]; !done; done = true)

-static struct hrtimer_clock_base *
-__next_base(struct hrtimer_cpu_base *cpu_base, unsigned int *active)
+#define hrtimer_from_timerqueue_node(_n) container_of_const(_n, struct hrtimer, node)
+
+#if defined(CONFIG_NO_HZ_COMMON)
+/*
+ * Same as hrtimer_bases_next_event() below, but skips the excluded timer and
+ * does not update cpu_base->next_timer/expires.
+ */
+static ktime_t hrtimer_bases_next_event_without(struct hrtimer_cpu_base *cpu_base,
+ const struct hrtimer *exclude,
+ unsigned int active, ktime_t expires_next)
{
- unsigned int idx;
+ struct hrtimer_clock_base *base;
+ ktime_t expires;

- if (!*active)
- return NULL;
+ lockdep_assert_held(&cpu_base->lock);

- idx = __ffs(*active);
- *active &= ~(1U << idx);
+ for_each_active_base(base, cpu_base, active) {
+ expires = ktime_sub(base->expires_next, base->offset);
+ if (expires >= expires_next)
+ continue;
+
+ /*
+ * If the excluded timer is the first on this base evaluate the
+ * next timer.
+ */
+ struct timerqueue_linked_node *node = timerqueue_linked_first(&base->active);

- return &cpu_base->clock_base[idx];
+ if (unlikely(&exclude->node == node)) {
+ node = timerqueue_linked_next(node);
+ if (!node)
+ continue;
+ expires = ktime_sub(node->expires, base->offset);
+ if (expires >= expires_next)
+ continue;
+ }
+ expires_next = expires;
+ }
+ /* If base->offset changed, the result might be negative */
+ return max(expires_next, 0);
}
+#endif

-#define for_each_active_base(base, cpu_base, active) \
- while ((base = __next_base((cpu_base), &(active))))
+static __always_inline struct hrtimer *clock_base_next_timer(struct hrtimer_clock_base *base)
+{
+ struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);

-static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
- const struct hrtimer *exclude,
- unsigned int active,
- ktime_t expires_next)
+ return hrtimer_from_timerqueue_node(next);
+}
+
+/* Find the base with the earliest expiry */
+static void hrtimer_bases_first(struct hrtimer_cpu_base *cpu_base,unsigned int active,
+ ktime_t *expires_next, struct hrtimer **next_timer)
{
struct hrtimer_clock_base *base;
ktime_t expires;

for_each_active_base(base, cpu_base, active) {
- struct timerqueue_node *next;
- struct hrtimer *timer;
-
- next = timerqueue_getnext(&base->active);
- timer = container_of(next, struct hrtimer, node);
- if (timer == exclude) {
- /* Get to the next timer in the queue. */
- next = timerqueue_iterate_next(next);
- if (!next)
- continue;
-
- timer = container_of(next, struct hrtimer, node);
- }
- expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
- if (expires < expires_next) {
- expires_next = expires;
-
- /* Skip cpu_base update if a timer is being excluded. */
- if (exclude)
- continue;
-
- if (timer->is_soft)
- cpu_base->softirq_next_timer = timer;
- else
- cpu_base->next_timer = timer;
+ expires = ktime_sub(base->expires_next, base->offset);
+ if (expires < *expires_next) {
+ *expires_next = expires;
+ *next_timer = clock_base_next_timer(base);
}
}
- /*
- * clock_was_set() might have changed base->offset of any of
- * the clock bases so the result might be negative. Fix it up
- * to prevent a false positive in clockevents_program_event().
- */
- if (expires_next < 0)
- expires_next = 0;
- return expires_next;
}

/*
@@ -575,30 +619,28 @@ static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
* - HRTIMER_ACTIVE_SOFT, or
* - HRTIMER_ACTIVE_HARD.
*/
-static ktime_t
-__hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask)
+static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask)
{
- unsigned int active;
struct hrtimer *next_timer = NULL;
ktime_t expires_next = KTIME_MAX;
+ unsigned int active;
+
+ lockdep_assert_held(&cpu_base->lock);

if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) {
active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
- cpu_base->softirq_next_timer = NULL;
- expires_next = __hrtimer_next_event_base(cpu_base, NULL,
- active, KTIME_MAX);
-
- next_timer = cpu_base->softirq_next_timer;
+ if (active)
+ hrtimer_bases_first(cpu_base, active, &expires_next, &next_timer);
+ cpu_base->softirq_next_timer = next_timer;
}

if (active_mask & HRTIMER_ACTIVE_HARD) {
active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
+ if (active)
+ hrtimer_bases_first(cpu_base, active, &expires_next, &next_timer);
cpu_base->next_timer = next_timer;
- expires_next = __hrtimer_next_event_base(cpu_base, NULL, active,
- expires_next);
}
-
- return expires_next;
+ return max(expires_next, 0);
}

static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base)
@@ -638,8 +680,8 @@ static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
ktime_t *offs_boot = &base->clock_base[HRTIMER_BASE_BOOTTIME].offset;
ktime_t *offs_tai = &base->clock_base[HRTIMER_BASE_TAI].offset;

- ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq,
- offs_real, offs_boot, offs_tai);
+ ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq, offs_real,
+ offs_boot, offs_tai);

base->clock_base[HRTIMER_BASE_REALTIME_SOFT].offset = *offs_real;
base->clock_base[HRTIMER_BASE_BOOTTIME_SOFT].offset = *offs_boot;
@@ -649,7 +691,9 @@ static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
}

/*
- * Is the high resolution mode active ?
+ * Is the high resolution mode active in the CPU base. This cannot use the
+ * static key as the CPUs are switched to high resolution mode
+ * asynchronously.
*/
static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base)
{
@@ -657,8 +701,13 @@ static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base)
cpu_base->hres_active : 0;
}

-static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
- struct hrtimer *next_timer,
+static inline void hrtimer_rearm_event(ktime_t expires_next, bool deferred)
+{
+ trace_hrtimer_rearm(expires_next, deferred);
+ tick_program_event(expires_next, 1);
+}
+
+static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, struct hrtimer *next_timer,
ktime_t expires_next)
{
cpu_base->expires_next = expires_next;
@@ -683,20 +732,13 @@ static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
if (!hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
return;

- tick_program_event(expires_next, 1);
+ hrtimer_rearm_event(expires_next, false);
}

-/*
- * Reprogram the event source with checking both queues for the
- * next event
- * Called with interrupts disabled and base->lock held
- */
-static void
-hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
+/* Reprogram the event source with a evaluation of all clock bases */
+static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, bool skip_equal)
{
- ktime_t expires_next;
-
- expires_next = hrtimer_update_next_event(cpu_base);
+ ktime_t expires_next = hrtimer_update_next_event(cpu_base);

if (skip_equal && expires_next == cpu_base->expires_next)
return;
@@ -707,57 +749,49 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
/* High resolution timer related functions */
#ifdef CONFIG_HIGH_RES_TIMERS

-/*
- * High resolution timer enabled ?
- */
+/* High resolution timer enabled ? */
static bool hrtimer_hres_enabled __read_mostly = true;
unsigned int hrtimer_resolution __read_mostly = LOW_RES_NSEC;
EXPORT_SYMBOL_GPL(hrtimer_resolution);

-/*
- * Enable / Disable high resolution mode
- */
+/* Enable / Disable high resolution mode */
static int __init setup_hrtimer_hres(char *str)
{
return (kstrtobool(str, &hrtimer_hres_enabled) == 0);
}
-
__setup("highres=", setup_hrtimer_hres);

-/*
- * hrtimer_high_res_enabled - query, if the highres mode is enabled
- */
-static inline int hrtimer_is_hres_enabled(void)
+/* hrtimer_high_res_enabled - query, if the highres mode is enabled */
+static inline bool hrtimer_is_hres_enabled(void)
{
return hrtimer_hres_enabled;
}

-/*
- * Switch to high resolution mode
- */
+/* Switch to high resolution mode */
static void hrtimer_switch_to_hres(void)
{
struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases);

if (tick_init_highres()) {
- pr_warn("Could not switch to high resolution mode on CPU %u\n",
- base->cpu);
+ pr_warn("Could not switch to high resolution mode on CPU %u\n", base->cpu);
return;
}
- base->hres_active = 1;
+ base->hres_active = true;
hrtimer_resolution = HIGH_RES_NSEC;

tick_setup_sched_timer(true);
/* "Retrigger" the interrupt to get things going */
retrigger_next_event(NULL);
+ hrtimer_schedule_hres_work();
}

#else

-static inline int hrtimer_is_hres_enabled(void) { return 0; }
+static inline bool hrtimer_is_hres_enabled(void) { return 0; }
static inline void hrtimer_switch_to_hres(void) { }

#endif /* CONFIG_HIGH_RES_TIMERS */
+
/*
* Retrigger next event is called after clock was set with interrupts
* disabled through an SMP function call or directly from low level
@@ -792,13 +826,12 @@ static void retrigger_next_event(void *arg)
* In periodic low resolution mode, the next softirq expiration
* must also be updated.
*/
- raw_spin_lock(&base->lock);
+ guard(raw_spinlock)(&base->lock);
hrtimer_update_base(base);
if (hrtimer_hres_active(base))
- hrtimer_force_reprogram(base, 0);
+ hrtimer_force_reprogram(base, /* skip_equal */ false);
else
hrtimer_update_next_event(base);
- raw_spin_unlock(&base->lock);
}

/*
@@ -812,10 +845,11 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
{
struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
struct hrtimer_clock_base *base = timer->base;
- ktime_t expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
+ ktime_t expires = hrtimer_get_expires(timer);

- WARN_ON_ONCE(hrtimer_get_expires(timer) < 0);
+ WARN_ON_ONCE(expires < 0);

+ expires = ktime_sub(expires, base->offset);
/*
* CLOCK_REALTIME timer might be requested with an absolute
* expiry time which is less than base->offset. Set it to 0.
@@ -842,8 +876,7 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
timer_cpu_base->softirq_next_timer = timer;
timer_cpu_base->softirq_expires_next = expires;

- if (!ktime_before(expires, timer_cpu_base->expires_next) ||
- !reprogram)
+ if (!ktime_before(expires, timer_cpu_base->expires_next) || !reprogram)
return;
}

@@ -857,11 +890,8 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
if (expires >= cpu_base->expires_next)
return;

- /*
- * If the hrtimer interrupt is running, then it will reevaluate the
- * clock bases and reprogram the clock event device.
- */
- if (cpu_base->in_hrtirq)
+ /* If a deferred rearm is pending skip reprogramming the device */
+ if (cpu_base->deferred_rearm)
return;

cpu_base->next_timer = timer;
@@ -869,8 +899,7 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
__hrtimer_reprogram(cpu_base, timer, expires);
}

-static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
- unsigned int active)
+static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base, unsigned int active)
{
struct hrtimer_clock_base *base;
unsigned int seq;
@@ -896,13 +925,11 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
if (seq == cpu_base->clock_was_set_seq)
return false;

- /*
- * If the remote CPU is currently handling an hrtimer interrupt, it
- * will reevaluate the first expiring timer of all clock bases
- * before reprogramming. Nothing to do here.
- */
- if (cpu_base->in_hrtirq)
+ /* If a deferred rearm is pending the remote CPU will take care of it */
+ if (cpu_base->deferred_rearm) {
+ cpu_base->deferred_needs_update = true;
return false;
+ }

/*
* Walk the affected clock bases and check whether the first expiring
@@ -913,9 +940,9 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
active &= cpu_base->active_bases;

for_each_active_base(base, cpu_base, active) {
- struct timerqueue_node *next;
+ struct timerqueue_linked_node *next;

- next = timerqueue_getnext(&base->active);
+ next = timerqueue_linked_first(&base->active);
expires = ktime_sub(next->expires, base->offset);
if (expires < cpu_base->expires_next)
return true;
@@ -947,11 +974,9 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
*/
void clock_was_set(unsigned int bases)
{
- struct hrtimer_cpu_base *cpu_base = raw_cpu_ptr(&hrtimer_bases);
cpumask_var_t mask;
- int cpu;

- if (!hrtimer_hres_active(cpu_base) && !tick_nohz_is_active())
+ if (!hrtimer_highres_enabled() && !tick_nohz_is_active())
goto out_timerfd;

if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
@@ -960,23 +985,19 @@ void clock_was_set(unsigned int bases)
}

/* Avoid interrupting CPUs if possible */
- cpus_read_lock();
- for_each_online_cpu(cpu) {
- unsigned long flags;
-
- cpu_base = &per_cpu(hrtimer_bases, cpu);
- raw_spin_lock_irqsave(&cpu_base->lock, flags);
+ scoped_guard(cpus_read_lock) {
+ int cpu;

- if (update_needs_ipi(cpu_base, bases))
- cpumask_set_cpu(cpu, mask);
+ for_each_online_cpu(cpu) {
+ struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);

- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+ guard(raw_spinlock_irqsave)(&cpu_base->lock);
+ if (update_needs_ipi(cpu_base, bases))
+ cpumask_set_cpu(cpu, mask);
+ }
+ scoped_guard(preempt)
+ smp_call_function_many(mask, retrigger_next_event, NULL, 1);
}
-
- preempt_disable();
- smp_call_function_many(mask, retrigger_next_event, NULL, 1);
- preempt_enable();
- cpus_read_unlock();
free_cpumask_var(mask);

out_timerfd:
@@ -1011,11 +1032,8 @@ void hrtimers_resume_local(void)
retrigger_next_event(NULL);
}

-/*
- * Counterpart to lock_hrtimer_base above:
- */
-static inline
-void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
+/* Counterpart to lock_hrtimer_base above */
+static inline void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
__releases(&timer->base->cpu_base->lock)
{
raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
@@ -1032,7 +1050,7 @@ void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
* .. note::
* This only updates the timer expiry value and does not requeue the timer.
*
- * There is also a variant of the function hrtimer_forward_now().
+ * There is also a variant of this function: hrtimer_forward_now().
*
* Context: Can be safely called from the callback function of @timer. If called
* from other contexts @timer must neither be enqueued nor running the
@@ -1042,15 +1060,15 @@ void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
*/
u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
{
- u64 orun = 1;
ktime_t delta;
+ u64 orun = 1;

delta = ktime_sub(now, hrtimer_get_expires(timer));

if (delta < 0)
return 0;

- if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
+ if (WARN_ON(timer->is_queued))
return 0;

if (interval < hrtimer_resolution)
@@ -1079,73 +1097,98 @@ EXPORT_SYMBOL_GPL(hrtimer_forward);
* enqueue_hrtimer - internal function to (re)start a timer
*
* The timer is inserted in expiry order. Insertion into the
- * red black tree is O(log(n)). Must hold the base lock.
+ * red black tree is O(log(n)).
*
* Returns true when the new timer is the leftmost timer in the tree.
*/
static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
- enum hrtimer_mode mode)
+ enum hrtimer_mode mode, bool was_armed)
{
- debug_activate(timer, mode);
+ lockdep_assert_held(&base->cpu_base->lock);
+
+ debug_activate(timer, mode, was_armed);
WARN_ON_ONCE(!base->cpu_base->online);

base->cpu_base->active_bases |= 1 << base->index;

/* Pairs with the lockless read in hrtimer_is_queued() */
- WRITE_ONCE(timer->state, HRTIMER_STATE_ENQUEUED);
+ WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
+
+ if (!timerqueue_linked_add(&base->active, &timer->node))
+ return false;
+
+ base->expires_next = hrtimer_get_expires(timer);
+ return true;
+}

- return timerqueue_add(&base->active, &timer->node);
+static inline void base_update_next_timer(struct hrtimer_clock_base *base)
+{
+ struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
+
+ base->expires_next = next ? next->expires : KTIME_MAX;
}

/*
* __remove_hrtimer - internal function to remove a timer
*
- * Caller must hold the base lock.
- *
* High resolution timer mode reprograms the clock event device when the
* timer is the one which expires next. The caller can disable this by setting
* reprogram to zero. This is useful, when the context does a reprogramming
* anyway (e.g. timer interrupt)
*/
-static void __remove_hrtimer(struct hrtimer *timer,
- struct hrtimer_clock_base *base,
- u8 newstate, int reprogram)
+static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+ bool newstate, bool reprogram)
{
struct hrtimer_cpu_base *cpu_base = base->cpu_base;
- u8 state = timer->state;
+ bool was_first;

- /* Pairs with the lockless read in hrtimer_is_queued() */
- WRITE_ONCE(timer->state, newstate);
- if (!(state & HRTIMER_STATE_ENQUEUED))
+ lockdep_assert_held(&cpu_base->lock);
+
+ if (!timer->is_queued)
return;

- if (!timerqueue_del(&base->active, &timer->node))
+ /* Pairs with the lockless read in hrtimer_is_queued() */
+ WRITE_ONCE(timer->is_queued, newstate);
+
+ was_first = !timerqueue_linked_prev(&timer->node);
+
+ if (!timerqueue_linked_del(&base->active, &timer->node))
cpu_base->active_bases &= ~(1 << base->index);

+ /* Nothing to update if this was not the first timer in the base */
+ if (!was_first)
+ return;
+
+ base_update_next_timer(base);
+
/*
- * Note: If reprogram is false we do not update
- * cpu_base->next_timer. This happens when we remove the first
- * timer on a remote cpu. No harm as we never dereference
- * cpu_base->next_timer. So the worst thing what can happen is
- * an superfluous call to hrtimer_force_reprogram() on the
- * remote cpu later on if the same timer gets enqueued again.
+ * If reprogram is false don't update cpu_base->next_timer and do not
+ * touch the clock event device.
+ *
+ * This happens when removing the first timer on a remote CPU, which
+ * will be handled by the remote CPU's interrupt. It also happens when
+ * a local timer is removed to be immediately restarted. That's handled
+ * at the call site.
*/
- if (reprogram && timer == cpu_base->next_timer)
- hrtimer_force_reprogram(cpu_base, 1);
+ if (!reprogram || timer != cpu_base->next_timer || timer->is_lazy)
+ return;
+
+ if (cpu_base->deferred_rearm)
+ cpu_base->deferred_needs_update = true;
+ else
+ hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
}

-/*
- * remove hrtimer, called with base lock held
- */
-static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
- bool restart, bool keep_local)
+static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+ bool newstate)
{
- u8 state = timer->state;
+ lockdep_assert_held(&base->cpu_base->lock);

- if (state & HRTIMER_STATE_ENQUEUED) {
+ if (timer->is_queued) {
bool reprogram;

+ debug_hrtimer_deactivate(timer);
+
/*
* Remove the timer and force reprogramming when high
* resolution mode is active and the timer is on the current
@@ -1154,24 +1197,81 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
* reprogramming happens in the interrupt handler. This is a
* rare case and less expensive than a smp call.
*/
- debug_deactivate(timer);
reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);

- /*
- * If the timer is not restarted then reprogramming is
- * required if the timer is local. If it is local and about
- * to be restarted, avoid programming it twice (on removal
- * and a moment later when it's requeued).
- */
- if (!restart)
- state = HRTIMER_STATE_INACTIVE;
- else
- reprogram &= !keep_local;
+ __remove_hrtimer(timer, base, newstate, reprogram);
+ return true;
+ }
+ return false;
+}
+
+/*
+ * Update in place has to retrieve the expiry times of the neighbour nodes
+ * if they exist. That is cache line neutral because the dequeue/enqueue
+ * operation is going to need the same cache lines. But there is a big win
+ * when the dequeue/enqueue can be avoided because the RB tree does not
+ * have to be rebalanced twice.
+ */
+static inline bool
+hrtimer_can_update_in_place(struct hrtimer *timer, struct hrtimer_clock_base *base, ktime_t expires)
+{
+ struct timerqueue_linked_node *next = timerqueue_linked_next(&timer->node);
+ struct timerqueue_linked_node *prev = timerqueue_linked_prev(&timer->node);
+
+ /* If the new expiry goes behind the next timer, requeue is required */
+ if (next && expires > next->expires)
+ return false;
+
+ /* If this is the first timer, update in place */
+ if (!prev)
+ return true;
+
+ /* Update in place when it does not go ahead of the previous one */
+ return expires >= prev->expires;
+}
+
+static inline bool
+remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
+ const enum hrtimer_mode mode, ktime_t expires, u64 delta_ns)
+{
+ bool was_first = false;
+
+ /* Remove it from the timer queue if active */
+ if (timer->is_queued) {
+ was_first = !timerqueue_linked_prev(&timer->node);
+
+ /* Try to update in place to avoid the de/enqueue dance */
+ if (hrtimer_can_update_in_place(timer, base, expires)) {
+ hrtimer_set_expires_range_ns(timer, expires, delta_ns);
+ trace_hrtimer_start(timer, mode, true);
+ if (was_first)
+ base->expires_next = expires;
+ return was_first;
+ }

- __remove_hrtimer(timer, base, state, reprogram);
- return 1;
+ debug_hrtimer_deactivate(timer);
+ timerqueue_linked_del(&base->active, &timer->node);
}
- return 0;
+
+ /* Set the new expiry time */
+ hrtimer_set_expires_range_ns(timer, expires, delta_ns);
+
+ debug_activate(timer, mode, timer->is_queued);
+ base->cpu_base->active_bases |= 1 << base->index;
+
+ /* Pairs with the lockless read in hrtimer_is_queued() */
+ WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
+
+ /* If it's the first expiring timer now or again, update base */
+ if (timerqueue_linked_add(&base->active, &timer->node)) {
+ base->expires_next = expires;
+ return true;
+ }
+
+ if (was_first)
+ base_update_next_timer(base);
+
+ return false;
}

static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
@@ -1190,55 +1290,93 @@ static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
return tim;
}

-static void
-hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram)
+static void hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram)
{
- ktime_t expires;
-
- /*
- * Find the next SOFT expiration.
- */
- expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
+ ktime_t expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);

/*
- * reprogramming needs to be triggered, even if the next soft
- * hrtimer expires at the same time than the next hard
+ * Reprogramming needs to be triggered, even if the next soft
+ * hrtimer expires at the same time as the next hard
* hrtimer. cpu_base->softirq_expires_next needs to be updated!
*/
if (expires == KTIME_MAX)
return;

/*
- * cpu_base->*next_timer is recomputed by __hrtimer_get_next_event()
- * cpu_base->*expires_next is only set by hrtimer_reprogram()
+ * cpu_base->next_timer is recomputed by __hrtimer_get_next_event()
+ * cpu_base->expires_next is only set by hrtimer_reprogram()
*/
hrtimer_reprogram(cpu_base->softirq_next_timer, reprogram);
}

-static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
- u64 delta_ns, const enum hrtimer_mode mode,
- struct hrtimer_clock_base *base)
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, bool is_pinned)
+{
+ if (static_branch_likely(&timers_migration_enabled)) {
+ /*
+ * If it is local and the first expiring timer keep it on the local
+ * CPU to optimize reprogramming of the clockevent device. Also
+ * avoid switch_hrtimer_base() overhead when local and pinned.
+ */
+ if (!is_local)
+ return false;
+ if (is_first || is_pinned)
+ return true;
+
+ /* Honour the NOHZ full restrictions */
+ if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_KERNEL_NOISE))
+ return false;
+
+ /*
+ * If the tick is not stopped or need_resched() is set, then
+ * there is no point in moving the timer somewhere else.
+ */
+ return !tick_nohz_tick_stopped() || need_resched();
+ }
+ return is_local;
+}
+#else
+static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, bool is_pinned)
+{
+ return is_local;
+}
+#endif
+
+static inline bool hrtimer_keep_base(struct hrtimer *timer, bool is_local, bool is_first,
+ bool is_pinned)
+{
+ /* If the timer is running the callback it has to stay on its CPU base. */
+ if (unlikely(timer->base->running == timer))
+ return true;
+
+ return hrtimer_prefer_local(is_local, is_first, is_pinned);
+}
+
+static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
+ const enum hrtimer_mode mode, struct hrtimer_clock_base *base)
{
struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
- struct hrtimer_clock_base *new_base;
- bool force_local, first;
+ bool is_pinned, first, was_first, keep_base = false;
+ struct hrtimer_cpu_base *cpu_base = base->cpu_base;

- /*
- * If the timer is on the local cpu base and is the first expiring
- * timer then this might end up reprogramming the hardware twice
- * (on removal and on enqueue). To avoid that by prevent the
- * reprogram on removal, keep the timer local to the current CPU
- * and enforce reprogramming after it is queued no matter whether
- * it is the new first expiring timer again or not.
- */
- force_local = base->cpu_base == this_cpu_base;
- force_local &= base->cpu_base->next_timer == timer;
+ was_first = cpu_base->next_timer == timer;
+ is_pinned = !!(mode & HRTIMER_MODE_PINNED);

/*
- * Don't force local queuing if this enqueue happens on a unplugged
- * CPU after hrtimer_cpu_dying() has been invoked.
+ * Don't keep it local if this enqueue happens on a unplugged CPU
+ * after hrtimer_cpu_dying() has been invoked.
*/
- force_local &= this_cpu_base->online;
+ if (likely(this_cpu_base->online)) {
+ bool is_local = cpu_base == this_cpu_base;
+
+ keep_base = hrtimer_keep_base(timer, is_local, was_first, is_pinned);
+ }
+
+ /* Calculate absolute expiry time for relative timers */
+ if (mode & HRTIMER_MODE_REL)
+ tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
+ /* Compensate for low resolution granularity */
+ tim = hrtimer_update_lowres(timer, tim, mode);

/*
* Remove an active timer from the queue. In case it is not queued
@@ -1250,32 +1388,41 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
* reprogramming later if it was the first expiring timer. This
* avoids programming the underlying clock event twice (once at
* removal and once after enqueue).
+ *
+ * @keep_base is also true if the timer callback is running on a
+ * remote CPU and for local pinned timers.
*/
- remove_hrtimer(timer, base, true, force_local);
+ if (likely(keep_base)) {
+ first = remove_and_enqueue_same_base(timer, base, mode, tim, delta_ns);
+ } else {
+ /* Keep the ENQUEUED state in case it is queued */
+ bool was_armed = remove_hrtimer(timer, base, HRTIMER_STATE_ENQUEUED);

- if (mode & HRTIMER_MODE_REL)
- tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
+ hrtimer_set_expires_range_ns(timer, tim, delta_ns);

- tim = hrtimer_update_lowres(timer, tim, mode);
+ /* Switch the timer base, if necessary: */
+ base = switch_hrtimer_base(timer, base, is_pinned);
+ cpu_base = base->cpu_base;

- hrtimer_set_expires_range_ns(timer, tim, delta_ns);
+ first = enqueue_hrtimer(timer, base, mode, was_armed);
+ }

- /* Switch the timer base, if necessary: */
- if (!force_local) {
- new_base = switch_hrtimer_base(timer, base,
- mode & HRTIMER_MODE_PINNED);
- } else {
- new_base = base;
+ /* If a deferred rearm is pending skip reprogramming the device */
+ if (cpu_base->deferred_rearm) {
+ cpu_base->deferred_needs_update = true;
+ return false;
}

- first = enqueue_hrtimer(timer, new_base, mode);
- if (!force_local) {
+ if (!was_first || cpu_base != this_cpu_base) {
/*
- * If the current CPU base is online, then the timer is
- * never queued on a remote CPU if it would be the first
- * expiring timer there.
+ * If the current CPU base is online, then the timer is never
+ * queued on a remote CPU if it would be the first expiring
+ * timer there unless the timer callback is currently executed
+ * on the remote CPU. In the latter case the remote CPU will
+ * re-evaluate the first expiring timer after completing the
+ * callbacks.
*/
- if (hrtimer_base_is_online(this_cpu_base))
+ if (likely(hrtimer_base_is_online(this_cpu_base)))
return first;

/*
@@ -1283,21 +1430,33 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
* already offline. If the timer is the first to expire,
* kick the remote CPU to reprogram the clock event.
*/
- if (first) {
- struct hrtimer_cpu_base *new_cpu_base = new_base->cpu_base;
+ if (first)
+ smp_call_function_single_async(cpu_base->cpu, &cpu_base->csd);
+ return false;
+ }

- smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd);
- }
- return 0;
+ /*
+ * Special case for the HRTICK timer. It is frequently rearmed and most
+ * of the time moves the expiry into the future. That's expensive in
+ * virtual machines and it's better to take the pointless already armed
+ * interrupt than reprogramming the hardware on every context switch.
+ *
+ * If the new expiry is before the armed time, then reprogramming is
+ * required.
+ */
+ if (timer->is_lazy) {
+ if (cpu_base->expires_next <= hrtimer_get_expires(timer))
+ return false;
}

/*
- * Timer was forced to stay on the current CPU to avoid
- * reprogramming on removal and enqueue. Force reprogram the
- * hardware by evaluating the new first expiring timer.
+ * Timer was the first expiring timer and forced to stay on the
+ * current CPU to avoid reprogramming on removal and enqueue. Force
+ * reprogram the hardware by evaluating the new first expiring
+ * timer.
*/
- hrtimer_force_reprogram(new_base->cpu_base, 1);
- return 0;
+ hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
+ return false;
}

/**
@@ -1309,12 +1468,14 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
* relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED);
* softirq based mode is considered for debug purpose only!
*/
-void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
- u64 delta_ns, const enum hrtimer_mode mode)
+void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
+ const enum hrtimer_mode mode)
{
struct hrtimer_clock_base *base;
unsigned long flags;

+ debug_hrtimer_assert_init(timer);
+
/*
* Check whether the HRTIMER_MODE_SOFT bit and hrtimer.is_soft
* match on CONFIG_PREEMPT_RT = n. With PREEMPT_RT check the hard
@@ -1362,8 +1523,11 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)

base = lock_hrtimer_base(timer, &flags);

- if (!hrtimer_callback_running(timer))
- ret = remove_hrtimer(timer, base, false, false);
+ if (!hrtimer_callback_running(timer)) {
+ ret = remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE);
+ if (ret)
+ trace_hrtimer_cancel(timer);
+ }

unlock_hrtimer_base(timer, &flags);

@@ -1397,8 +1561,7 @@ static void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base)
* the timer callback to finish. Drop expiry_lock and reacquire it. That
* allows the waiter to acquire the lock and make progress.
*/
-static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base,
- unsigned long flags)
+static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base, unsigned long flags)
{
if (atomic_read(&cpu_base->timer_waiters)) {
raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
@@ -1463,14 +1626,10 @@ void hrtimer_cancel_wait_running(const struct hrtimer *timer)
spin_unlock_bh(&base->cpu_base->softirq_expiry_lock);
}
#else
-static inline void
-hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { }
-static inline void
-hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { }
-static inline void
-hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { }
-static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base,
- unsigned long flags) { }
+static inline void hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base, unsigned long fl) { }
#endif

/**
@@ -1526,15 +1685,11 @@ u64 hrtimer_get_next_event(void)
{
struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
u64 expires = KTIME_MAX;
- unsigned long flags;
-
- raw_spin_lock_irqsave(&cpu_base->lock, flags);

+ guard(raw_spinlock_irqsave)(&cpu_base->lock);
if (!hrtimer_hres_active(cpu_base))
expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);

- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
return expires;
}

@@ -1549,26 +1704,20 @@ u64 hrtimer_next_event_without(const struct hrtimer *exclude)
{
struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
u64 expires = KTIME_MAX;
- unsigned long flags;
-
- raw_spin_lock_irqsave(&cpu_base->lock, flags);
-
- if (hrtimer_hres_active(cpu_base)) {
- unsigned int active;
+ unsigned int active;

- if (!cpu_base->softirq_activated) {
- active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
- expires = __hrtimer_next_event_base(cpu_base, exclude,
- active, KTIME_MAX);
- }
- active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
- expires = __hrtimer_next_event_base(cpu_base, exclude, active,
- expires);
- }
+ guard(raw_spinlock_irqsave)(&cpu_base->lock);
+ if (!hrtimer_hres_active(cpu_base))
+ return expires;

- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+ active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+ if (active && !cpu_base->softirq_activated)
+ expires = hrtimer_bases_next_event_without(cpu_base, exclude, active, KTIME_MAX);

- return expires;
+ active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
+ if (!active)
+ return expires;
+ return hrtimer_bases_next_event_without(cpu_base, exclude, active, expires);
}
#endif

@@ -1612,8 +1761,7 @@ ktime_t hrtimer_cb_get_time(const struct hrtimer *timer)
}
EXPORT_SYMBOL_GPL(hrtimer_cb_get_time);

-static void __hrtimer_setup(struct hrtimer *timer,
- enum hrtimer_restart (*function)(struct hrtimer *),
+static void __hrtimer_setup(struct hrtimer *timer, enum hrtimer_restart (*fn)(struct hrtimer *),
clockid_t clock_id, enum hrtimer_mode mode)
{
bool softtimer = !!(mode & HRTIMER_MODE_SOFT);
@@ -1645,13 +1793,14 @@ static void __hrtimer_setup(struct hrtimer *timer,
base += hrtimer_clockid_to_base(clock_id);
timer->is_soft = softtimer;
timer->is_hard = !!(mode & HRTIMER_MODE_HARD);
+ timer->is_lazy = !!(mode & HRTIMER_MODE_LAZY_REARM);
timer->base = &cpu_base->clock_base[base];
- timerqueue_init(&timer->node);
+ timerqueue_linked_init(&timer->node);

- if (WARN_ON_ONCE(!function))
+ if (WARN_ON_ONCE(!fn))
ACCESS_PRIVATE(timer, function) = hrtimer_dummy_timeout;
else
- ACCESS_PRIVATE(timer, function) = function;
+ ACCESS_PRIVATE(timer, function) = fn;
}

/**
@@ -1710,12 +1859,10 @@ bool hrtimer_active(const struct hrtimer *timer)
base = READ_ONCE(timer->base);
seq = raw_read_seqcount_begin(&base->seq);

- if (timer->state != HRTIMER_STATE_INACTIVE ||
- base->running == timer)
+ if (timer->is_queued || base->running == timer)
return true;

- } while (read_seqcount_retry(&base->seq, seq) ||
- base != READ_ONCE(timer->base));
+ } while (read_seqcount_retry(&base->seq, seq) || base != READ_ONCE(timer->base));

return false;
}
@@ -1729,7 +1876,7 @@ EXPORT_SYMBOL_GPL(hrtimer_active);
* - callback: the timer is being ran
* - post: the timer is inactive or (re)queued
*
- * On the read side we ensure we observe timer->state and cpu_base->running
+ * On the read side we ensure we observe timer->is_queued and cpu_base->running
* from the same section, if anything changed while we looked at it, we retry.
* This includes timer->base changing because sequence numbers alone are
* insufficient for that.
@@ -1738,11 +1885,9 @@ EXPORT_SYMBOL_GPL(hrtimer_active);
* a false negative if the read side got smeared over multiple consecutive
* __run_hrtimer() invocations.
*/
-
-static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
- struct hrtimer_clock_base *base,
- struct hrtimer *timer, ktime_t *now,
- unsigned long flags) __must_hold(&cpu_base->lock)
+static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, struct hrtimer_clock_base *base,
+ struct hrtimer *timer, ktime_t now, unsigned long flags)
+ __must_hold(&cpu_base->lock)
{
enum hrtimer_restart (*fn)(struct hrtimer *);
bool expires_in_hardirq;
@@ -1754,15 +1899,15 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
base->running = timer;

/*
- * Separate the ->running assignment from the ->state assignment.
+ * Separate the ->running assignment from the ->is_queued assignment.
*
* As with a regular write barrier, this ensures the read side in
* hrtimer_active() cannot observe base->running == NULL &&
- * timer->state == INACTIVE.
+ * timer->is_queued == INACTIVE.
*/
raw_write_seqcount_barrier(&base->seq);

- __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);
+ __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, false);
fn = ACCESS_PRIVATE(timer, function);

/*
@@ -1797,16 +1942,15 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
* hrtimer_start_range_ns() can have popped in and enqueued the timer
* for us already.
*/
- if (restart != HRTIMER_NORESTART &&
- !(timer->state & HRTIMER_STATE_ENQUEUED))
- enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS);
+ if (restart == HRTIMER_RESTART && !timer->is_queued)
+ enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS, false);

/*
- * Separate the ->running assignment from the ->state assignment.
+ * Separate the ->running assignment from the ->is_queued assignment.
*
* As with a regular write barrier, this ensures the read side in
* hrtimer_active() cannot observe base->running.timer == NULL &&
- * timer->state == INACTIVE.
+ * timer->is_queued == INACTIVE.
*/
raw_write_seqcount_barrier(&base->seq);

@@ -1814,23 +1958,24 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
base->running = NULL;
}

+static __always_inline struct hrtimer *clock_base_next_timer_safe(struct hrtimer_clock_base *base)
+{
+ struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
+
+ return next ? hrtimer_from_timerqueue_node(next) : NULL;
+}
+
static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
unsigned long flags, unsigned int active_mask)
{
- struct hrtimer_clock_base *base;
unsigned int active = cpu_base->active_bases & active_mask;
+ struct hrtimer_clock_base *base;

for_each_active_base(base, cpu_base, active) {
- struct timerqueue_node *node;
- ktime_t basenow;
-
- basenow = ktime_add(now, base->offset);
-
- while ((node = timerqueue_getnext(&base->active))) {
- struct hrtimer *timer;
-
- timer = container_of(node, struct hrtimer, node);
+ ktime_t basenow = ktime_add(now, base->offset);
+ struct hrtimer *timer;

+ while ((timer = clock_base_next_timer(base))) {
/*
* The immediate goal for using the softexpires is
* minimizing wakeups, not running timers at the
@@ -1846,7 +1991,7 @@ static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
if (basenow < hrtimer_get_softexpires(timer))
break;

- __run_hrtimer(cpu_base, base, timer, &basenow, flags);
+ __run_hrtimer(cpu_base, base, timer, basenow, flags);
if (active_mask == HRTIMER_ACTIVE_SOFT)
hrtimer_sync_wait_running(cpu_base, flags);
}
@@ -1865,7 +2010,7 @@ static __latent_entropy void hrtimer_run_softirq(void)
now = hrtimer_update_base(cpu_base);
__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_SOFT);

- cpu_base->softirq_activated = 0;
+ cpu_base->softirq_activated = false;
hrtimer_update_softirq_timer(cpu_base, true);

raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
@@ -1874,6 +2019,63 @@ static __latent_entropy void hrtimer_run_softirq(void)

#ifdef CONFIG_HIGH_RES_TIMERS

+/*
+ * Very similar to hrtimer_force_reprogram(), except it deals with
+ * deferred_rearm and hang_detected.
+ */
+static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next, bool deferred)
+{
+ cpu_base->expires_next = expires_next;
+ cpu_base->deferred_rearm = false;
+
+ if (unlikely(cpu_base->hang_detected)) {
+ /*
+ * Give the system a chance to do something else than looping
+ * on hrtimer interrupts.
+ */
+ expires_next = ktime_add_ns(ktime_get(),
+ min(100 * NSEC_PER_MSEC, cpu_base->max_hang_time));
+ }
+ hrtimer_rearm_event(expires_next, deferred);
+}
+
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+void __hrtimer_rearm_deferred(void)
+{
+ struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+ ktime_t expires_next;
+
+ if (!cpu_base->deferred_rearm)
+ return;
+
+ guard(raw_spinlock)(&cpu_base->lock);
+ if (cpu_base->deferred_needs_update) {
+ hrtimer_update_base(cpu_base);
+ expires_next = hrtimer_update_next_event(cpu_base);
+ } else {
+ /* No timer added/removed. Use the cached value */
+ expires_next = cpu_base->deferred_expires_next;
+ }
+ hrtimer_rearm(cpu_base, expires_next, true);
+}
+
+static __always_inline void
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
+{
+ /* hrtimer_interrupt() just re-evaluated the first expiring timer */
+ cpu_base->deferred_needs_update = false;
+ /* Cache the expiry time */
+ cpu_base->deferred_expires_next = expires_next;
+ set_thread_flag(TIF_HRTIMER_REARM);
+}
+#else /* CONFIG_HRTIMER_REARM_DEFERRED */
+static __always_inline void
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
+{
+ hrtimer_rearm(cpu_base, expires_next, false);
+}
+#endif /* !CONFIG_HRTIMER_REARM_DEFERRED */
+
/*
* High resolution timer interrupt
* Called with interrupts disabled
@@ -1888,86 +2090,55 @@ void hrtimer_interrupt(struct clock_event_device *dev)
BUG_ON(!cpu_base->hres_active);
cpu_base->nr_events++;
dev->next_event = KTIME_MAX;
+ dev->next_event_forced = 0;

raw_spin_lock_irqsave(&cpu_base->lock, flags);
entry_time = now = hrtimer_update_base(cpu_base);
retry:
- cpu_base->in_hrtirq = 1;
+ cpu_base->deferred_rearm = true;
/*
- * We set expires_next to KTIME_MAX here with cpu_base->lock
- * held to prevent that a timer is enqueued in our queue via
- * the migration code. This does not affect enqueueing of
- * timers which run their callback and need to be requeued on
- * this CPU.
+ * Set expires_next to KTIME_MAX, which prevents that remote CPUs queue
+ * timers while __hrtimer_run_queues() is expiring the clock bases.
+ * Timers which are re/enqueued on the local CPU are not affected by
+ * this.
*/
cpu_base->expires_next = KTIME_MAX;

if (!ktime_before(now, cpu_base->softirq_expires_next)) {
cpu_base->softirq_expires_next = KTIME_MAX;
- cpu_base->softirq_activated = 1;
+ cpu_base->softirq_activated = true;
raise_timer_softirq(HRTIMER_SOFTIRQ);
}

__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);

- /* Reevaluate the clock bases for the [soft] next expiry */
- expires_next = hrtimer_update_next_event(cpu_base);
- /*
- * Store the new expiry value so the migration code can verify
- * against it.
- */
- cpu_base->expires_next = expires_next;
- cpu_base->in_hrtirq = 0;
- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
- /* Reprogramming necessary ? */
- if (!tick_program_event(expires_next, 0)) {
- cpu_base->hang_detected = 0;
- return;
- }
-
/*
* The next timer was already expired due to:
* - tracing
* - long lasting callbacks
* - being scheduled away when running in a VM
*
- * We need to prevent that we loop forever in the hrtimer
- * interrupt routine. We give it 3 attempts to avoid
- * overreacting on some spurious event.
- *
- * Acquire base lock for updating the offsets and retrieving
- * the current time.
+ * We need to prevent that we loop forever in the hrtiner interrupt
+ * routine. We give it 3 attempts to avoid overreacting on some
+ * spurious event.
*/
- raw_spin_lock_irqsave(&cpu_base->lock, flags);
now = hrtimer_update_base(cpu_base);
- cpu_base->nr_retries++;
- if (++retries < 3)
- goto retry;
- /*
- * Give the system a chance to do something else than looping
- * here. We stored the entry time, so we know exactly how long
- * we spent here. We schedule the next event this amount of
- * time away.
- */
- cpu_base->nr_hangs++;
- cpu_base->hang_detected = 1;
- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+ expires_next = hrtimer_update_next_event(cpu_base);
+ cpu_base->hang_detected = false;
+ if (expires_next < now) {
+ if (++retries < 3)
+ goto retry;
+
+ delta = ktime_sub(now, entry_time);
+ cpu_base->max_hang_time = max_t(unsigned int, cpu_base->max_hang_time, delta);
+ cpu_base->nr_hangs++;
+ cpu_base->hang_detected = true;
+ }

- delta = ktime_sub(now, entry_time);
- if ((unsigned int)delta > cpu_base->max_hang_time)
- cpu_base->max_hang_time = (unsigned int) delta;
- /*
- * Limit it to a sensible value as we enforce a longer
- * delay. Give the CPU at least 100ms to catch up.
- */
- if (delta > 100 * NSEC_PER_MSEC)
- expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
- else
- expires_next = ktime_add(now, delta);
- tick_program_event(expires_next, 1);
- pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
+ hrtimer_interrupt_rearm(cpu_base, expires_next);
+ raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
}
+
#endif /* !CONFIG_HIGH_RES_TIMERS */

/*
@@ -1999,7 +2170,7 @@ void hrtimer_run_queues(void)

if (!ktime_before(now, cpu_base->softirq_expires_next)) {
cpu_base->softirq_expires_next = KTIME_MAX;
- cpu_base->softirq_activated = 1;
+ cpu_base->softirq_activated = true;
raise_timer_softirq(HRTIMER_SOFTIRQ);
}

@@ -2012,8 +2183,7 @@ void hrtimer_run_queues(void)
*/
static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
{
- struct hrtimer_sleeper *t =
- container_of(timer, struct hrtimer_sleeper, timer);
+ struct hrtimer_sleeper *t = container_of(timer, struct hrtimer_sleeper, timer);
struct task_struct *task = t->task;

t->task = NULL;
@@ -2031,8 +2201,7 @@ static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
* Wrapper around hrtimer_start_expires() for hrtimer_sleeper based timers
* to allow PREEMPT_RT to tweak the delivery mode (soft/hardirq context)
*/
-void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl,
- enum hrtimer_mode mode)
+void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl, enum hrtimer_mode mode)
{
/*
* Make the enqueue delivery mode check work on RT. If the sleeper
@@ -2048,8 +2217,8 @@ void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl,
}
EXPORT_SYMBOL_GPL(hrtimer_sleeper_start_expires);

-static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl,
- clockid_t clock_id, enum hrtimer_mode mode)
+static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl, clockid_t clock_id,
+ enum hrtimer_mode mode)
{
/*
* On PREEMPT_RT enabled kernels hrtimers which are not explicitly
@@ -2085,8 +2254,8 @@ static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl,
* @clock_id: the clock to be used
* @mode: timer mode abs/rel
*/
-void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl,
- clockid_t clock_id, enum hrtimer_mode mode)
+void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl, clockid_t clock_id,
+ enum hrtimer_mode mode)
{
debug_setup_on_stack(&sl->timer, clock_id, mode);
__hrtimer_setup_sleeper(sl, clock_id, mode);
@@ -2159,12 +2328,11 @@ static long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
return ret;
}

-long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
- const clockid_t clockid)
+long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode, const clockid_t clockid)
{
struct restart_block *restart;
struct hrtimer_sleeper t;
- int ret = 0;
+ int ret;

hrtimer_setup_sleeper_on_stack(&t, clockid, mode);
hrtimer_set_expires_range_ns(&t.timer, rqtp, current->timer_slack_ns);
@@ -2203,8 +2371,7 @@ SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec __user *, rqtp,
current->restart_block.fn = do_no_restart_syscall;
current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE;
current->restart_block.nanosleep.rmtp = rmtp;
- return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL,
- CLOCK_MONOTONIC);
+ return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC);
}

#endif
@@ -2212,7 +2379,7 @@ SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec __user *, rqtp,
#ifdef CONFIG_COMPAT_32BIT_TIME

SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
- struct old_timespec32 __user *, rmtp)
+ struct old_timespec32 __user *, rmtp)
{
struct timespec64 tu;

@@ -2225,8 +2392,7 @@ SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
current->restart_block.fn = do_no_restart_syscall;
current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE;
current->restart_block.nanosleep.compat_rmtp = rmtp;
- return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL,
- CLOCK_MONOTONIC);
+ return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC);
}
#endif

@@ -2236,14 +2402,13 @@ SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
int hrtimers_prepare_cpu(unsigned int cpu)
{
struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
- int i;

- for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+ for (int i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
struct hrtimer_clock_base *clock_b = &cpu_base->clock_base[i];

clock_b->cpu_base = cpu_base;
seqcount_raw_spinlock_init(&clock_b->seq, &cpu_base->lock);
- timerqueue_init_head(&clock_b->active);
+ timerqueue_linked_init_head(&clock_b->active);
}

cpu_base->cpu = cpu;
@@ -2257,13 +2422,14 @@ int hrtimers_cpu_starting(unsigned int cpu)

/* Clear out any left over state from a CPU down operation */
cpu_base->active_bases = 0;
- cpu_base->hres_active = 0;
- cpu_base->hang_detected = 0;
+ cpu_base->hres_active = false;
+ cpu_base->hang_detected = false;
cpu_base->next_timer = NULL;
cpu_base->softirq_next_timer = NULL;
cpu_base->expires_next = KTIME_MAX;
cpu_base->softirq_expires_next = KTIME_MAX;
- cpu_base->online = 1;
+ cpu_base->softirq_activated = false;
+ cpu_base->online = true;
return 0;
}

@@ -2272,20 +2438,20 @@ int hrtimers_cpu_starting(unsigned int cpu)
static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
struct hrtimer_clock_base *new_base)
{
+ struct timerqueue_linked_node *node;
struct hrtimer *timer;
- struct timerqueue_node *node;

- while ((node = timerqueue_getnext(&old_base->active))) {
- timer = container_of(node, struct hrtimer, node);
+ while ((node = timerqueue_linked_first(&old_base->active))) {
+ timer = hrtimer_from_timerqueue_node(node);
BUG_ON(hrtimer_callback_running(timer));
- debug_deactivate(timer);
+ debug_hrtimer_deactivate(timer);

/*
* Mark it as ENQUEUED not INACTIVE otherwise the
* timer could be seen as !active and just vanish away
* under us on another CPU
*/
- __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
+ __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, false);
timer->base = new_base;
/*
* Enqueue the timers on the new cpu. This does not
@@ -2295,13 +2461,13 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
* sort out already expired timers and reprogram the
* event device.
*/
- enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS);
+ enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS, true);
}
}

int hrtimers_cpu_dying(unsigned int dying_cpu)
{
- int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+ int ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
struct hrtimer_cpu_base *old_base, *new_base;

old_base = this_cpu_ptr(&hrtimer_bases);
@@ -2314,16 +2480,14 @@ int hrtimers_cpu_dying(unsigned int dying_cpu)
raw_spin_lock(&old_base->lock);
raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);

- for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
- migrate_hrtimer_list(&old_base->clock_base[i],
- &new_base->clock_base[i]);
- }
+ for (int i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
+ migrate_hrtimer_list(&old_base->clock_base[i], &new_base->clock_base[i]);

/* Tell the other CPU to retrigger the next event */
smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);

raw_spin_unlock(&new_base->lock);
- old_base->online = 0;
+ old_base->online = false;
raw_spin_unlock(&old_base->lock);

return 0;
diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c
index 9daf8c5d9687..1c954f330dfe 100644
--- a/kernel/time/jiffies.c
+++ b/kernel/time/jiffies.c
@@ -32,7 +32,6 @@ static u64 jiffies_read(struct clocksource *cs)
static struct clocksource clocksource_jiffies = {
.name = "jiffies",
.rating = 1, /* lowest valid rating*/
- .uncertainty_margin = 32 * NSEC_PER_MSEC,
.read = jiffies_read,
.mask = CLOCKSOURCE_MASK(32),
.mult = TICK_NSEC << JIFFIES_SHIFT, /* details above */
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 413e2389f0a5..9331e1614124 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -1092,7 +1092,7 @@ void exit_itimers(struct task_struct *tsk)
}

/*
- * There should be no timers on the ignored list. itimer_delete() has
+ * There should be no timers on the ignored list. posix_timer_delete() has
* mopped them up.
*/
if (!WARN_ON_ONCE(!hlist_empty(&tsk->signal->ignored_posix_timers)))
diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
index a88b72b0f35e..51f6a1032c83 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -78,7 +78,6 @@ static struct clock_event_device ce_broadcast_hrtimer = {
.set_state_shutdown = bc_shutdown,
.set_next_ktime = bc_set_next,
.features = CLOCK_EVT_FEAT_ONESHOT |
- CLOCK_EVT_FEAT_KTIME |
CLOCK_EVT_FEAT_HRTIMER,
.rating = 0,
.bound_on = -1,
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index f63c65881364..7e57fa31ee26 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -76,8 +76,10 @@ const struct clock_event_device *tick_get_wakeup_device(int cpu)
*/
static void tick_broadcast_start_periodic(struct clock_event_device *bc)
{
- if (bc)
+ if (bc) {
+ bc->next_event_forced = 0;
tick_setup_periodic(bc, 1);
+ }
}

/*
@@ -403,6 +405,7 @@ static void tick_handle_periodic_broadcast(struct clock_event_device *dev)
bool bc_local;

raw_spin_lock(&tick_broadcast_lock);
+ tick_broadcast_device.evtdev->next_event_forced = 0;

/* Handle spurious interrupts gracefully */
if (clockevent_state_shutdown(tick_broadcast_device.evtdev)) {
@@ -696,6 +699,7 @@ static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)

raw_spin_lock(&tick_broadcast_lock);
dev->next_event = KTIME_MAX;
+ tick_broadcast_device.evtdev->next_event_forced = 0;
next_event = KTIME_MAX;
cpumask_clear(tmpmask);
now = ktime_get();
@@ -1063,6 +1067,7 @@ static void tick_broadcast_setup_oneshot(struct clock_event_device *bc,

bc->event_handler = tick_handle_oneshot_broadcast;
+ bc->next_event_forced = 0;
bc->next_event = KTIME_MAX;

/*
@@ -1175,6 +1180,7 @@ void hotplug_cpu__broadcast_tick_pull(int deadcpu)
}

/* This moves the broadcast assignment to this CPU: */
+ bc->next_event_forced = 0;
clockevents_program_event(bc, bc->next_event, 1);
}
raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index d305d8521896..6a9198a4279b 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -110,6 +110,7 @@ void tick_handle_periodic(struct clock_event_device *dev)
int cpu = smp_processor_id();
ktime_t next = dev->next_event;

+ dev->next_event_forced = 0;
tick_periodic(cpu);

/*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f7907fadd63f..cbbb87a0c6e7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -345,7 +345,7 @@ static bool check_tick_dependency(atomic_t *dep)
int val = atomic_read(dep);

if (likely(!tracepoint_enabled(tick_stop)))
- return !val;
+ return !!val;

if (val & TICK_DEP_MASK_POSIX_TIMER) {
trace_tick_stop(0, TICK_DEP_MASK_POSIX_TIMER);
@@ -864,19 +864,32 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);

+/* Simplified variant of hrtimer_forward_now() */
+static ktime_t tick_forward_now(ktime_t expires, ktime_t now)
+{
+ ktime_t delta = now - expires;
+
+ if (likely(delta < TICK_NSEC))
+ return expires + TICK_NSEC;
+
+ expires += TICK_NSEC * ktime_divns(delta, TICK_NSEC);
+ if (expires > now)
+ return expires;
+ return expires + TICK_NSEC;
+}
+
static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
{
- hrtimer_cancel(&ts->sched_timer);
- hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
+ ktime_t expires = ts->last_tick;

- /* Forward the time to expire in the future */
- hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
+ if (now >= expires)
+ expires = tick_forward_now(expires, now);

if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES)) {
- hrtimer_start_expires(&ts->sched_timer,
- HRTIMER_MODE_ABS_PINNED_HARD);
+ hrtimer_start(&ts->sched_timer, expires, HRTIMER_MODE_ABS_PINNED_HARD);
} else {
- tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
+ hrtimer_set_expires(&ts->sched_timer, expires);
+ tick_program_event(expires, 1);
}

/*
@@ -1513,6 +1526,7 @@ static void tick_nohz_lowres_handler(struct clock_event_device *dev)
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

dev->next_event = KTIME_MAX;
+ dev->next_event_forced = 0;

if (likely(tick_nohz_handler(&ts->sched_timer) == HRTIMER_RESTART))
tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index c07e562ee4c1..c493a4010305 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -3,34 +3,30 @@
* Kernel timekeeping code and accessor functions. Based on code from
* timer.c, moved in commit 8524070b7982.
*/
-#include <linux/timekeeper_internal.h>
-#include <linux/module.h>
-#include <linux/interrupt.h>
+#include <linux/audit.h>
+#include <linux/clocksource.h>
+#include <linux/compiler.h>
+#include <linux/jiffies.h>
#include <linux/kobject.h>
-#include <linux/percpu.h>
-#include <linux/init.h>
-#include <linux/mm.h>
+#include <linux/module.h>
#include <linux/nmi.h>
-#include <linux/sched.h>
-#include <linux/sched/loadavg.h>
+#include <linux/pvclock_gtod.h>
+#include <linux/random.h>
#include <linux/sched/clock.h>
+#include <linux/sched/loadavg.h>
+#include <linux/static_key.h>
+#include <linux/stop_machine.h>
#include <linux/syscore_ops.h>
-#include <linux/clocksource.h>
-#include <linux/jiffies.h>
+#include <linux/tick.h>
#include <linux/time.h>
#include <linux/timex.h>
-#include <linux/tick.h>
-#include <linux/stop_machine.h>
-#include <linux/pvclock_gtod.h>
-#include <linux/compiler.h>
-#include <linux/audit.h>
-#include <linux/random.h>
+#include <linux/timekeeper_internal.h>

#include <vdso/auxclock.h>

#include "tick-internal.h"
-#include "ntp_internal.h"
#include "timekeeping_internal.h"
+#include "ntp_internal.h"

#define TK_CLEAR_NTP (1 << 0)
#define TK_CLOCK_WAS_SET (1 << 1)
@@ -275,6 +271,11 @@ static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta)
tk->monotonic_to_boot = ktime_to_timespec64(tk->offs_boot);
}

+#ifdef CONFIG_ARCH_WANTS_CLOCKSOURCE_READ_INLINE
+#include <asm/clock_inlined.h>
+
+static DEFINE_STATIC_KEY_FALSE(clocksource_read_inlined);
+
/*
* tk_clock_read - atomic clocksource read() helper
*
@@ -288,12 +289,35 @@ static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta)
* a read of the fast-timekeeper tkrs (which is protected by its own locking
* and update logic).
*/
-static inline u64 tk_clock_read(const struct tk_read_base *tkr)
+static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
+{
+ struct clocksource *clock = READ_ONCE(tkr->clock);
+
+ if (static_branch_likely(&clocksource_read_inlined))
+ return arch_inlined_clocksource_read(clock);
+
+ return clock->read(clock);
+}
+
+static inline void clocksource_disable_inline_read(void)
+{
+ static_branch_disable(&clocksource_read_inlined);
+}
+
+static inline void clocksource_enable_inline_read(void)
+{
+ static_branch_enable(&clocksource_read_inlined);
+}
+#else
+static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
{
struct clocksource *clock = READ_ONCE(tkr->clock);

return clock->read(clock);
}
+static inline void clocksource_disable_inline_read(void) { }
+static inline void clocksource_enable_inline_read(void) { }
+#endif

/**
* tk_setup_internals - Set up internals to use clocksource clock.
@@ -367,6 +391,27 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
tk->tkr_raw.mult = clock->mult;
tk->ntp_err_mult = 0;
tk->skip_second_overflow = 0;
+
+ tk->cs_id = clock->id;
+
+ /* Coupled clockevent data */
+ if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) &&
+ clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT) {
+ /*
+ * Aim for an one hour maximum delta and use KHz to handle
+ * clocksources with a frequency above 4GHz correctly as
+ * the frequency argument of clocks_calc_mult_shift() is u32.
+ */
+ clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift,
+ NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000);
+ /*
+ * Initialize the conversion limit as the previous clocksource
+ * might have the same shift/mult pair so the quick check in
+ * tk_update_ns_to_cyc() fails to update it after a clocksource
+ * change leaving it effectivly zero.
+ */
+ tk->cs_ns_to_cyc_maxns = div_u64(clock->mask, tk->cs_ns_to_cyc_mult);
+ }
}

/* Timekeeper helper functions. */
@@ -375,7 +420,7 @@ static noinline u64 delta_to_ns_safe(const struct tk_read_base *tkr, u64 delta)
return mul_u64_u32_add_u64_shr(delta, tkr->mult, tkr->xtime_nsec, tkr->shift);
}

-static inline u64 timekeeping_cycles_to_ns(const struct tk_read_base *tkr, u64 cycles)
+static __always_inline u64 timekeeping_cycles_to_ns(const struct tk_read_base *tkr, u64 cycles)
{
/* Calculate the delta since the last update_wall_time() */
u64 mask = tkr->mask, delta = (cycles - tkr->cycle_last) & mask;
@@ -696,6 +741,36 @@ static inline void tk_update_ktime_data(struct timekeeper *tk)
tk->tkr_raw.base = ns_to_ktime(tk->raw_sec * NSEC_PER_SEC);
}

+static inline void tk_update_ns_to_cyc(struct timekeeper *tks, struct timekeeper *tkc)
+{
+ struct tk_read_base *tkrs = &tks->tkr_mono;
+ struct tk_read_base *tkrc = &tkc->tkr_mono;
+ unsigned int shift;
+
+ if (!IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) ||
+ !(tkrs->clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT))
+ return;
+
+ if (tkrs->mult == tkrc->mult && tkrs->shift == tkrc->shift)
+ return;
+ /*
+ * The conversion math is simple:
+ *
+ * CS::MULT (1 << NS_TO_CYC_SHIFT)
+ * --------------- = ----------------------
+ * (1 << CS:SHIFT) NS_TO_CYC_MULT
+ *
+ * Ergo:
+ *
+ * NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT
+ *
+ * NS_TO_CYC_SHIFT has been set up in tk_setup_internals()
+ */
+ shift = tkrs->shift + tks->cs_ns_to_cyc_shift;
+ tks->cs_ns_to_cyc_mult = (u32)div_u64(1ULL << shift, tkrs->mult);
+ tks->cs_ns_to_cyc_maxns = div_u64(tkrs->clock->mask, tks->cs_ns_to_cyc_mult);
+}
+
/*
* Restore the shadow timekeeper from the real timekeeper.
*/
@@ -730,6 +805,7 @@ static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int act
tk->tkr_mono.base_real = tk->tkr_mono.base + tk->offs_real;

if (tk->id == TIMEKEEPER_CORE) {
+ tk_update_ns_to_cyc(tk, &tkd->timekeeper);
update_vsyscall(tk);
update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);

@@ -784,6 +860,71 @@ static void timekeeping_forward_now(struct timekeeper *tk)
tk_update_coarse_nsecs(tk);
}

+/*
+ * ktime_expiry_to_cycles - Convert a expiry time to clocksource cycles
+ * @id: Clocksource ID which is required for validity
+ * @expires_ns: Absolute CLOCK_MONOTONIC expiry time (nsecs) to be converted
+ * @cycles: Pointer to storage for corresponding absolute cycles value
+ *
+ * Convert a CLOCK_MONOTONIC based absolute expiry time to a cycles value
+ * based on the correlated clocksource of the clockevent device by using
+ * the base nanoseconds and cycles values of the last timekeeper update and
+ * converting the delta between @expires_ns and base nanoseconds to cycles.
+ *
+ * This only works for clockevent devices which are using a less than or
+ * equal comparator against the clocksource.
+ *
+ * Utilizing this avoids two clocksource reads for such devices, the
+ * ktime_get() in clockevents_program_event() to calculate the delta expiry
+ * value and the readout in the device::set_next_event() callback to
+ * convert the delta back to a absolute comparator value.
+ *
+ * Returns: True if @id matches the current clocksource ID, false otherwise
+ */
+bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles)
+{
+ struct timekeeper *tk = &tk_core.timekeeper;
+ struct tk_read_base *tkrm = &tk->tkr_mono;
+ ktime_t base_ns, delta_ns, max_ns;
+ u64 base_cycles, delta_cycles;
+ unsigned int seq;
+ u32 mult, shift;
+
+ /*
+ * Racy check to avoid the seqcount overhead when ID does not match. If
+ * the relevant clocksource is installed concurrently, then this will
+ * just delay the switch over to this mechanism until the next event is
+ * programmed. If the ID is not matching the clock events code will use
+ * the regular relative set_next_event() callback as before.
+ */
+ if (data_race(tk->cs_id) != id)
+ return false;
+
+ do {
+ seq = read_seqcount_begin(&tk_core.seq);
+
+ if (tk->cs_id != id)
+ return false;
+
+ base_cycles = tkrm->cycle_last;
+ base_ns = tkrm->base + (tkrm->xtime_nsec >> tkrm->shift);
+
+ mult = tk->cs_ns_to_cyc_mult;
+ shift = tk->cs_ns_to_cyc_shift;
+ max_ns = tk->cs_ns_to_cyc_maxns;
+
+ } while (read_seqcount_retry(&tk_core.seq, seq));
+
+ /* Prevent negative deltas and multiplication overflows */
+ delta_ns = min(expires_ns - base_ns, max_ns);
+ delta_ns = max(delta_ns, 0);
+
+ /* Convert to cycles */
+ delta_cycles = ((u64)delta_ns * mult) >> shift;
+ *cycles = base_cycles + delta_cycles;
+ return true;
+}
+
/**
* ktime_get_real_ts64 - Returns the time of day in a timespec64.
* @ts: pointer to the timespec to be set
@@ -848,7 +989,7 @@ u32 ktime_get_resolution_ns(void)
}
EXPORT_SYMBOL_GPL(ktime_get_resolution_ns);

-static ktime_t *offsets[TK_OFFS_MAX] = {
+static const ktime_t *const offsets[TK_OFFS_MAX] = {
[TK_OFFS_REAL] = &tk_core.timekeeper.offs_real,
[TK_OFFS_BOOT] = &tk_core.timekeeper.offs_boot,
[TK_OFFS_TAI] = &tk_core.timekeeper.offs_tai,
@@ -857,8 +998,9 @@ static ktime_t *offsets[TK_OFFS_MAX] = {
ktime_t ktime_get_with_offset(enum tk_offsets offs)
{
struct timekeeper *tk = &tk_core.timekeeper;
+ const ktime_t *offset = offsets[offs];
unsigned int seq;
- ktime_t base, *offset = offsets[offs];
+ ktime_t base;
u64 nsecs;

WARN_ON(timekeeping_suspended);
@@ -878,8 +1020,9 @@ EXPORT_SYMBOL_GPL(ktime_get_with_offset);
ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs)
{
struct timekeeper *tk = &tk_core.timekeeper;
- ktime_t base, *offset = offsets[offs];
+ const ktime_t *offset = offsets[offs];
unsigned int seq;
+ ktime_t base;
u64 nsecs;

WARN_ON(timekeeping_suspended);
@@ -902,7 +1045,7 @@ EXPORT_SYMBOL_GPL(ktime_get_coarse_with_offset);
*/
ktime_t ktime_mono_to_any(ktime_t tmono, enum tk_offsets offs)
{
- ktime_t *offset = offsets[offs];
+ const ktime_t *offset = offsets[offs];
unsigned int seq;
ktime_t tconv;

@@ -1631,7 +1774,19 @@ int timekeeping_notify(struct clocksource *clock)

if (tk->tkr_mono.clock == clock)
return 0;
+
+ /* Disable inlined reads accross the clocksource switch */
+ clocksource_disable_inline_read();
+
stop_machine(change_clocksource, clock, NULL);
+
+ /*
+ * If the clocksource has been selected and supports inlined reads
+ * enable the branch.
+ */
+ if (tk->tkr_mono.clock == clock && clock->flags & CLOCK_SOURCE_CAN_INLINE_READ)
+ clocksource_enable_inline_read();
+
tick_clock_notify();
return tk->tkr_mono.clock == clock ? 0 : -1;
}
@@ -2834,7 +2989,7 @@ static void tk_aux_update_clocksource(void)
continue;

timekeeping_forward_now(tks);
- tk_setup_internals(tks, tk_core.timekeeper.tkr_mono.clock);
+ tk_setup_internals(tks, tk_core.timekeeper.tkr_raw.clock);
timekeeping_update_from_shadow(tkd, TK_UPDATE_ALL);
}
}
diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h
index 543beba096c7..198d0608db74 100644
--- a/kernel/time/timekeeping.h
+++ b/kernel/time/timekeeping.h
@@ -9,6 +9,8 @@ extern ktime_t ktime_get_update_offsets_now(unsigned int *cwsseq,
ktime_t *offs_boot,
ktime_t *offs_tai);

+bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles);
+
extern int timekeeping_valid_for_hres(void);
extern u64 timekeeping_max_deferment(void);
extern void timekeeping_warp_clock(void);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 7e1e3bde6b8b..04d928c21aba 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2319,6 +2319,7 @@ u64 timer_base_try_to_set_idle(unsigned long basej, u64 basem, bool *idle)
*/
void timer_clear_idle(void)
{
+ int this_cpu = smp_processor_id();
/*
* We do this unlocked. The worst outcome is a remote pinned timer
* enqueue sending a pointless IPI, but taking the lock would just
@@ -2327,9 +2328,9 @@ void timer_clear_idle(void)
* path. Required for BASE_LOCAL only.
*/
__this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
- if (tick_nohz_full_cpu(smp_processor_id()))
+ if (tick_nohz_full_cpu(this_cpu))
__this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
- trace_timer_base_idle(false, smp_processor_id());
+ trace_timer_base_idle(false, this_cpu);

/* Activate without holding the timer_base->lock */
tmigr_cpu_activate();
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 488e47e96e93..427d7ddea3af 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -47,7 +47,7 @@ print_timer(struct seq_file *m, struct hrtimer *taddr, struct hrtimer *timer,
int idx, u64 now)
{
SEQ_printf(m, " #%d: <%p>, %ps", idx, taddr, ACCESS_PRIVATE(timer, function));
- SEQ_printf(m, ", S:%02x", timer->state);
+ SEQ_printf(m, ", S:%02x", timer->is_queued);
SEQ_printf(m, "\n");
SEQ_printf(m, " # expires at %Lu-%Lu nsecs [in %Ld to %Ld nsecs]\n",
(unsigned long long)ktime_to_ns(hrtimer_get_softexpires(timer)),
@@ -56,13 +56,11 @@ print_timer(struct seq_file *m, struct hrtimer *taddr, struct hrtimer *timer,
(long long)(ktime_to_ns(hrtimer_get_expires(timer)) - now));
}

-static void
-print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base,
- u64 now)
+static void print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base, u64 now)
{
+ struct timerqueue_linked_node *curr;
struct hrtimer *timer, tmp;
unsigned long next = 0, i;
- struct timerqueue_node *curr;
unsigned long flags;

next_one:
@@ -72,13 +70,13 @@ print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base,

raw_spin_lock_irqsave(&base->cpu_base->lock, flags);

- curr = timerqueue_getnext(&base->active);
+ curr = timerqueue_linked_first(&base->active);
/*
* Crude but we have to do this O(N*N) thing, because
* we have to unlock the base when printing:
*/
while (curr && i < next) {
- curr = timerqueue_iterate_next(curr);
+ curr = timerqueue_linked_next(curr);
i++;
}

@@ -103,8 +101,8 @@ print_base(struct seq_file *m, struct hrtimer_clock_base *base, u64 now)

SEQ_printf(m, " .resolution: %u nsecs\n", hrtimer_resolution);
#ifdef CONFIG_HIGH_RES_TIMERS
- SEQ_printf(m, " .offset: %Lu nsecs\n",
- (unsigned long long) ktime_to_ns(base->offset));
+ SEQ_printf(m, " .offset: %Ld nsecs\n",
+ (long long) base->offset);
#endif
SEQ_printf(m, "active timers:\n");
print_active_timers(m, base, now + ktime_to_ns(base->offset));
diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index 8bb95b2a6fcf..39ac4eba0702 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -395,7 +395,7 @@ static enum print_line_t print_synth_event(struct trace_iterator *iter,
n_u64++;
} else {
struct trace_print_flags __flags[] = {
- __def_gfpflag_names, {-1, NULL} };
+ __def_gfpflag_names };
char *space = (i == se->n_fields - 1 ? "" : " ");

print_synth_event_num_val(s, print_fmt,
@@ -408,7 +408,7 @@ static enum print_line_t print_synth_event(struct trace_iterator *iter,
trace_seq_puts(s, " (");
trace_print_flags_seq(s, "|",
entry->fields[n_u64].as_u64,
- __flags);
+ __flags, ARRAY_SIZE(__flags));
trace_seq_putc(s, ')');
}
n_u64++;
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 1996d7aba038..96e2d22b4364 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -69,14 +69,15 @@ enum print_line_t trace_print_printk_msg_only(struct trace_iterator *iter)
const char *
trace_print_flags_seq(struct trace_seq *p, const char *delim,
unsigned long flags,
- const struct trace_print_flags *flag_array)
+ const struct trace_print_flags *flag_array,
+ size_t flag_array_size)
{
unsigned long mask;
const char *str;
const char *ret = trace_seq_buffer_ptr(p);
int i, first = 1;

- for (i = 0; flag_array[i].name && flags; i++) {
+ for (i = 0; i < flag_array_size && flags; i++) {

mask = flag_array[i].mask;
if ((flags & mask) != mask)
@@ -106,12 +107,13 @@ EXPORT_SYMBOL(trace_print_flags_seq);

const char *
trace_print_symbols_seq(struct trace_seq *p, unsigned long val,
- const struct trace_print_flags *symbol_array)
+ const struct trace_print_flags *symbol_array,
+ size_t symbol_array_size)
{
int i;
const char *ret = trace_seq_buffer_ptr(p);

- for (i = 0; symbol_array[i].name; i++) {
+ for (i = 0; i < symbol_array_size; i++) {

if (val != symbol_array[i].mask)
continue;
@@ -133,14 +135,15 @@ EXPORT_SYMBOL(trace_print_symbols_seq);
const char *
trace_print_flags_seq_u64(struct trace_seq *p, const char *delim,
unsigned long long flags,
- const struct trace_print_flags_u64 *flag_array)
+ const struct trace_print_flags_u64 *flag_array,
+ size_t flag_array_size)
{
unsigned long long mask;
const char *str;
const char *ret = trace_seq_buffer_ptr(p);
int i, first = 1;

- for (i = 0; flag_array[i].name && flags; i++) {
+ for (i = 0; i < flag_array_size && flags; i++) {

mask = flag_array[i].mask;
if ((flags & mask) != mask)
@@ -170,12 +173,13 @@ EXPORT_SYMBOL(trace_print_flags_seq_u64);

const char *
trace_print_symbols_seq_u64(struct trace_seq *p, unsigned long long val,
- const struct trace_print_flags_u64 *symbol_array)
+ const struct trace_print_flags_u64 *symbol_array,
+ size_t symbol_array_size)
{
int i;
const char *ret = trace_seq_buffer_ptr(p);

- for (i = 0; symbol_array[i].name; i++) {
+ for (i = 0; i < symbol_array_size; i++) {

if (val != symbol_array[i].mask)
continue;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 37317b81fcda..8ad72e17d8eb 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -174,7 +174,6 @@ sys_enter_openat_print(struct syscall_trace_enter *trace, struct syscall_metadat
{ O_NOFOLLOW, "O_NOFOLLOW" },
{ O_NOATIME, "O_NOATIME" },
{ O_CLOEXEC, "O_CLOEXEC" },
- { -1, NULL }
};

trace_seq_printf(s, "%s(", entry->name);
@@ -205,7 +204,7 @@ sys_enter_openat_print(struct syscall_trace_enter *trace, struct syscall_metadat
trace_seq_puts(s, "O_RDONLY|");
}

- trace_print_flags_seq(s, "|", bits, __flags);
+ trace_print_flags_seq(s, "|", bits, __flags, ARRAY_SIZE(__flags));
/*
* trace_print_flags_seq() adds a '\0' to the
* buffer, but this needs to append more to the seq.
diff --git a/lib/rbtree.c b/lib/rbtree.c
index 18d42bcf4ec9..5790d6ecba4e 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -446,6 +446,23 @@ void rb_erase(struct rb_node *node, struct rb_root *root)
}
EXPORT_SYMBOL(rb_erase);

+bool rb_erase_linked(struct rb_node_linked *node, struct rb_root_linked *root)
+{
+ if (node->prev)
+ node->prev->next = node->next;
+ else
+ root->rb_leftmost = node->next;
+
+ if (node->next)
+ node->next->prev = node->prev;
+
+ rb_erase(&node->node, &root->rb_root);
+ RB_CLEAR_LINKED_NODE(node);
+
+ return !!root->rb_leftmost;
+}
+EXPORT_SYMBOL_GPL(rb_erase_linked);
+
/*
* Augmented rbtree manipulation functions.
*
diff --git a/lib/timerqueue.c b/lib/timerqueue.c
index cdb9c7658478..e2a1e08cb4bd 100644
--- a/lib/timerqueue.c
+++ b/lib/timerqueue.c
@@ -82,3 +82,17 @@ struct timerqueue_node *timerqueue_iterate_next(struct timerqueue_node *node)
return container_of(next, struct timerqueue_node, node);
}
EXPORT_SYMBOL_GPL(timerqueue_iterate_next);
+
+#define __node_2_tq_linked(_n) \
+ container_of(rb_entry((_n), struct rb_node_linked, node), struct timerqueue_linked_node, node)
+
+static __always_inline bool __tq_linked_less(struct rb_node *a, const struct rb_node *b)
+{
+ return __node_2_tq_linked(a)->expires < __node_2_tq_linked(b)->expires;
+}
+
+bool timerqueue_linked_add(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node)
+{
+ return rb_add_linked(&node->node, &head->rb_root, __tq_linked_less);
+}
+EXPORT_SYMBOL_GPL(timerqueue_linked_add);
diff --git a/scripts/gdb/linux/timerlist.py b/scripts/gdb/linux/timerlist.py
index ccc24d30de80..9fb3436a217c 100644
--- a/scripts/gdb/linux/timerlist.py
+++ b/scripts/gdb/linux/timerlist.py
@@ -20,7 +20,7 @@ def ktime_get():
We can't read the hardware timer itself to add any nanoseconds
that need to be added since we last stored the time in the
timekeeper. But this is probably good enough for debug purposes."""
- tk_core = gdb.parse_and_eval("&tk_core")
+ tk_core = gdb.parse_and_eval("&timekeeper_data[TIMEKEEPER_CORE]")

return tk_core['timekeeper']['tkr_mono']['base']