From my perspective, this is a worse solution as now we're sweeping an
issue under the rug and consuming CPU cycles handling IRQs we should not
be getting in the first place. While an overflow IRQ from the cmn should
not be high frequency, there is a non-zero chance in the future it could
be and this could lead to a very hard to debug performance issue instead
of the current problem, which is discovering we need to clean up better
from a noisy kernel message.
Kexec is not the only possible source of spurious IRQs. If they cause a
problem for this driver, that cannot be robustly addressed by trying to
rely on whatever software might happen to run before this driver.
Sure, I can agree with the assertion a spurious IRQ could come from
anywhere, in that case though, shouldn't the behavior still be to log
spurious IRQs as a warning instead of silently sinking them?
The driver as best I can grok currently is optimized to limit the amount
of register writes for the common use-case, which is setting and unsetting
events, so all the wiring for the PMU to feed events to the DTC is done up
front on load: DTC_CTL's DT_EN bit is set immediately during probe, as is
OVFL_INTR_EN. All the DN states and DTM PMU_CONFIG_PMU_EN is deferred
for when an event is actually set, and here we go through all of them
anyways for each event unless its bynodeid, so the expense of setting
events grows linearly with the mesh size anyways.
If arm_cmn_init_dtc() writing 0 to PMCR didn't stop the PMU then we've
got bigger problems, because that's how we expect to start and stop it
in normal operation. I'm not ruling out that some subtle bug in that
regard might exist, since I've still not yet had a chance to reproduce
and observe this behaviour on my board, but I've also not seen
sufficient evidence to suggest that that is the case either. (Now that
I'm looking closely, I think there *is* actually a small oversight for
the DTMs, but that would lead to different symptoms than you reported)
At least the writes to PMOVSR_CLR *did* clearly work, because you're
seeing the "nobody cared" message from the IRQ core rather than the
WARN_ON(!dtc->counters[i]) which would happen if a fresh overflow was
actually asserted. Currently I would expect to see up to 4 of those
messages since there can be up to 4 IRQs, but once those are all
requested, enabled, and "handled", all the spurious initially-latched
state should be cleared and any *new* overflows will be indicated in
PMOVSR. I don't see how a single IRQ could ever be unhandled more than
once anyway, if the first time disables it.
I do see 4 of these "nobody cared" messages in all the times I've
reproduced it, but saw no need to copy paste all of them in with the
original post.
Looking back over the code I see why more clearly your
assertion we only need to clear the DT_EN bit as the PMU is off at
the DTC with the PMCR set to 0 on init, but it is really hard to
see why that is with all the various places bits of configuration is done,
but it is still not easy to verify if unsetting that bit is sufficient to
not get into some odd corner cases.
Is there any argument against me taking another pass and try separating
out discovery, from a shared reset/initialization code path?