[PATCH 2/2] perf: Don't throttle based on NMI watchdog events

From: Calvin Owens

Date: Tue Mar 31 2026 - 11:29:45 EST

The throttling logic in perf_sample_event_took() assumes the NMI is
running at the maximum allowed sample rate. While this makes sense most
of the time, it wildly overestimates the runtime of the NMI for the perf
hardware watchdog:

# bpftrace -e 'kprobe:perf_sample_event_took { \
printf("%s: cpu=%02d time_taken=%dns\n", \
strftime("%H:%M:%S.%f", nsecs), cpu(), arg0); }'
03:12:13.087003: cpu=00 time_taken=3190ns
03:12:13.486789: cpu=01 time_taken=2918ns
03:12:18.075288: cpu=03 time_taken=3308ns
03:12:19.797207: cpu=02 time_taken=2581ns
03:12:23.110317: cpu=00 time_taken=2823ns
03:12:23.510308: cpu=01 time_taken=2943ns
03:12:29.229348: cpu=03 time_taken=3669ns
03:12:31.656306: cpu=02 time_taken=3262ns

The NMI for the watchdog runs for 2-4us every ten seconds, but the
math done in perf_sample_event_took() concludes it is running for
200-400ms every second!

When it is the only PMU event running, it can take minutes to hours of
samples from the watchdog for the moving average to accumulate to
something near the real mean, which causes the same little "litany" of
sample rate throttles to happen every time Linux boots with the perf
hardware watchdog enabled:

perf: interrupt took too long (2526 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
perf: interrupt took too long (3177 > 3157), lowering kernel.perf_event_max_sample_rate to 62000
perf: interrupt took too long (3979 > 3971), lowering kernel.perf_event_max_sample_rate to 50000
perf: interrupt took too long (4983 > 4973), lowering kernel.perf_event_max_sample_rate to 40000

This serves no purpose: it doesn't actually affect the runtime of the
watchdog NMI at all. It confuses users, because it suggests their
machine is spinning its wheels in interrupts when it isn't.

Because the watchdog NMI is so infrequent, we can avoid throttling it by
making the throttling a two-step process: load and update a timestamp
whenever we think we need to throttle, and only actually proceed to
throttle if the last time that happened was less than one second ago.

This is inelegant, but it avoids touching the hot path and preserves
current throttling behavior for real PMU use, at the cost of delaying
the throttling by a single NMI.

Signed-off-by: Calvin Owens <calvin@xxxxxxxxxx>
---
kernel/events/core.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 89b40e439717..0f7a7e912f55 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -623,6 +623,7 @@ core_initcall(init_events_core_sysctls);
*/
#define NR_ACCUMULATED_SAMPLES 128
static DEFINE_PER_CPU(u64, running_sample_length);
+static DEFINE_PER_CPU(u64, last_throttle_clock);

static u64 __report_avg;
static u64 __report_allowed;
@@ -643,6 +644,8 @@ void perf_sample_event_took(u64 sample_len_ns)
u64 max_len = READ_ONCE(perf_sample_allowed_ns);
u64 running_len;
u64 avg_len;
+ u64 delta;
+ u64 now;
u32 max;

if (max_len == 0)
@@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns)
if (avg_len <= max_len)
return;

+ /*
+ * Very infrequent events like the perf counter hard watchdog
+ * can trigger spurious throttling: skip throttling if the prior
+ * NMI got here more than one second before this NMI began.
+ */
+ now = local_clock();
+ delta = now - __this_cpu_read(last_throttle_clock);
+ __this_cpu_write(last_throttle_clock, now);
+ if (delta - sample_len_ns > NSEC_PER_SEC)
+ return;
+
__report_avg = avg_len;
__report_allowed = max_len;

--
2.47.3