Re: [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events

From: Ian Rogers

Date: Wed Apr 29 2026 - 18:15:34 EST


On Wed, Apr 29, 2026 at 3:09 PM Calvin Owens <calvin@xxxxxxxxxx> wrote:
>
> On Wednesday 04/29 at 20:38 +0000, sashiko-bot@xxxxxxxxxx wrote:
> > > @@ -636,6 +639,19 @@ void perf_sample_event_took(u64 sample_len_ns)
> > > if (avg_len <= max_len)
> > > return;
> > >
> > > + /*
> > > + * Very infrequent events like the perf counter hard watchdog
> > > + * can trigger spurious throttling: skip throttling if the prior
> > > + * NMI got here more than one second before this NMI began. But
> > > + * never skip throttling if NMIs are nesting, or if any NMI runs
> > > + * for longer than one second.
> > > + */
> > > + now = local_clock();
> > > + last = __this_cpu_read(last_throttle_clock);
> > > + if (__this_cpu_cmpxchg(last_throttle_clock, last, now) == last &&
> >
> > On 32-bit architectures, is it safe to use __this_cpu_cmpxchg() on a 64-bit
> > value in NMI context?
> >
> > For architectures like 32-bit x86, the generic fallback implementation
> > achieves atomicity by disabling interrupts, which does not mask NMIs. If a
> > nested NMI interrupts the sequence halfway through reading or writing the
> > 32-bit halves of the u64, could this result in a torn read or write?
> >
> > A corrupted timestamp could cause a massive wrap-around in the time gap
> > calculation, perpetually satisfying the > NSEC_PER_SEC bypass condition and
> > silently disabling PMU throttling for all events on that CPU.
> >
> > > + now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_SEC)
>
> If this is a problem, isn't it also a problem for the 64-bit store after
> updating the EWMA just above this?
>
> I guess last_throttle_clock could be a u32 and use the low clock bits,
> that's sufficient with the one second limit... but I would appreciate a
> real human opinion :)
>
> > The comment indicates the intent is to skip throttling if the prior NMI
> > occurred more than one second before this NMI began.
>
> My comment is not very clear, I suppose. It should say:
>
> "...skip throttling if the prior attempt to throttle occurred more
> than one second ago, and the current NMI runtime was less than one
> second. But never skip throttling if NMIs are nesting."
>
> > However, local_clock() is sampled at the end of the current NMI, so now
> > includes the current NMI's duration.
> >
> > If an NMI takes a very long time (e.g., 900ms) but started only 200ms after
> > the previous NMI ended, the true gap before it began is 200ms. Because now
> > includes the 900ms duration, now - last evaluates to 1100ms, which bypasses
> > the throttling.
> >
> > Could this artificially inflate the time gap for long-running NMIs,
> > directly defeating the throttling mechanism?
>
> I think it's just saying "the code doesn't do what the comment says",
> which is true. But what it is describing is sort of addressed below too.
>
> > Additionally, is it possible for user-defined PMU events with expensive
> > overhead (like heavy eBPF programs attached to slow tracepoints) to exploit
> > this hardcoded frequency bypass?
>
> No, I don't think ebpf programs run in NMI context?

Unfortunately Sashiko isn't currently reading replied emails, but you
could copy-paste this feedback into an AI.

BPF programs can run in the NMI context, which means they can
introduce delays. I'm not familiar with all the side-effects of and
restrictions on this.

Thanks,
Ian

> > If an event triggers slower than 1Hz but takes a very long time to execute
> > (e.g., 999ms), both now - last > NSEC_PER_SEC and sample_len_ns <
> > NSEC_PER_SEC will be true.
> >
> > Such an event would bypass the moving average calculation and never be
> > throttled, allowing it to consume nearly 100% of the CPU while it runs and
> > violating the default 25% CPU time limit set by
> > sysctl_perf_cpu_time_max_percent.
>
> No matter where the limit is defined, NMIs firing at just under that
> limit will skip throttling. But I guess it could be something like:
>
> now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_MSEC
>
> ...to sort of costrain the worst case.
>
> But this all feels really arbitrary. I thought about using the watchdog
> interval sysctl minus some padding as the upper bound, but that seemed
> unnecessarily complicated.
>
> Maybe this second patch is just more trouble than its worth...
> especially if, as Andi noted earlier, the PMU watchdog is not long for
> this world...
>
> Thanks,
> Calvin
>
> > --
> > Sashiko AI review · https://sashiko.dev/#/patchset/cover.1777483573.git.calvin@xxxxxxxxxx?part=2