Re: armv8pmu: Pending overflow interrupt is discarded when perf event is disabled

From: Ashley, William
Date: Tue Dec 05 2023 - 13:34:02 EST


Mark,
Thanks for your time on this. Is there (any more of) a chance of getting a false second overflow signal with these changes? Especially for example when the counter is counting EL1, or when pseudo-nmi is enabled?


On 12/5/23, 12:15 PM, "Mark Rutland" <mark.rutland@xxxxxxx <mailto:mark.rutland@xxxxxxx>> wrote:


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.






On Thu, Nov 30, 2023 at 12:23:13PM +0000, Mark Rutland wrote:
> On Thu, Nov 30, 2023 at 11:48:54AM +0000, Mark Rutland wrote:
> > On Wed, Nov 29, 2023 at 04:35:01PM +0000, Mark Rutland wrote:
> > > Does RR set any of the perf_event_attr::exclude_* bits? If not, does RR
> > > intentionally count events that occur within the kernel?
> >
> > Looking at the test, I see it sets perf_event_attr::exclude_kernel to 1, but
> > doesn't set perf_event_attr::exclude_host or perf_event_attr::exclude_hv. I
> > think the poorly-defined exclude_* bits are part of the problem here.
> >
> > Using your test as-is on my ThunderX2, I can reproduce the period being longer
> > than expected by concurrently running the following in a shell:
> >
> > while true; do
> > for C in $(seq 0 63); do
> > taskset -c -p $C ${TEST_PID_HERE};
> > done;
> > done > /dev/null
> >
> > ... resulting in:
> >
> > | [mark@gravadlaks:~]% ./counter-overflow
> > | Pid 20060 running with period 10000 tolerance 1000
> > | Signal #1: last: 0, cur: 10292, max diff: 0
> > | Signal #415330: delta of 19999 is outside 10000 +/- 1000
> > | Signal #415330: last: 4153290187, cur: 4153310186, max diff: 10292
> > | Signal #489879: delta of 19998 is outside 10000 +/- 1000
> > | Signal #511842: delta of 20058 is outside 10000 +/- 1000
> > | Signal #511842: last: 5118430130, cur: 5118450188, max diff: 19999
> >
> > However, if I modify the test to also set perf_event_attr::exclude_host=1, I do
> > not see any lost overflows after many minutes. On VHE hosts (like the
> > ThunderX2), the host kernel gets counted when perf_event_attr::exclude_host=0,
> > even if perf_event_attr::exclude_kernel=1 (which I agree is surprising), so I
> > think what's happening is the counters are counting in the host kernel, which
> > isn't what RR actually wants regardless.
>
> > I'll continue to look at what we can do kernel-side, but I reckon it's worth
> > having RR try the other exclude bits regardless, if that's possible? It would
> > be interesting to know whether that helps you under a hypervisor.
>
> Sorry, the above is wrong, and I do not recommend RR goes and changes its
> exclude_* settings.
>
> I had misread the logic in armv8pmu_set_event_filter(), but looking again
> that's saner than I thought it was, and what was actually happening in my
> testing is that exclude_host also filtered host EL0 (userspace), and so the
> test received *no* overflow signals.
>
> I'll get back to looking at how we can better capture the overflow when
> removing an event.


I've spent a couple of days on this, and I have a prototype at:


https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=perf/overflow <https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=perf/overflow>


The gist of that is:


* It's not viable to sample at start/stop time since we don't have all the
usual sample state (e.g. pt_regs), which means that we have to save/restore
the overflow status and rely on the regular interrupt handler to actually
sample after this is restored.


* Saving/restoring the overflow status requires reworking the way we manage the
sample period, separating the sample period from the period we program into
HW. This part looks fairly sane, but requires some rework of all the arm_pmu
drivers.


* We might need a couple of small changes to the perf core code (e.g. for the
IOCTLs that mess with the period), and this change is liable to affect other
architectures, so we'd need ot go audit that.


I reckon it's possible (with some testing and review) to get something like
this series merged, but I suspect it's too big for a stable backport.


Mark.