Re: perf 6.9-1 (archlinux) crashes during recording of cycles + raw_syscalls

From: Namhyung Kim
Date: Thu Jun 06 2024 - 18:20:21 EST


On Tue, Jun 04, 2024 at 04:02:08PM -0300, Arnaldo Carvalho de Melo wrote:
> On Tue, Jun 04, 2024 at 11:48:09AM -0700, Ian Rogers wrote:
> > On Tue, Jun 4, 2024 at 7:12 AM Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> wrote:
> > > Can you please try with the attached and perhaps provide your Tested-by?
>
> > > From ab355e2c6b4cf641a9fff7af38059cf69ac712d5 Mon Sep 17 00:00:00 2001
> > > From: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>
> > > Date: Tue, 4 Jun 2024 11:00:22 -0300
> > > Subject: [PATCH 1/1] Revert "perf record: Reduce memory for recording
> > > PERF_RECORD_LOST_SAMPLES event"
>
> > > This reverts commit 7d1405c71df21f6c394b8a885aa8a133f749fa22.
>
> > I think we should try to fight back reverts when possible. Reverts are
> > removing something somebody poured time and attention into. When a
>
> While in the development phase, yeah, but when we find a regression and
> the revert makes it go away, that is the way to go.
>
> The person who poured time on the development gets notified and can
> decide if/when to try again.
>
> Millian had to pour time to figure out why something stopped working,
> was kind enough to provide the output from multiple tools to help in
> fixing the problem and I had to do the bisect to figure out when the
> problem happened and to check if reverting it we would have the tool
> working again.
>
> If we try to fix this for v6.10 we may end up adding yet another bug, so
> the safe thing to do at this point is to do the revert.
>
> We can try improving this once again for v6.11.

I think I found a couple of problems with this issue. :(

1. perf_session__set_id_hdr_size() uses the first evsel in the session
But I think it should pick the tracking event. I guess we assume
all events have the same set of sample_type wrt the sample_id_all
but I'm not sure if it's correct.

2. With --call-graph dwarf, it seems to set unrelated sample type bits
in the attr like ADDR and DATA_SRC.

3. For tracepoint events, evsel__newtp_idx() sets a couple of sample
type regardless of the configuration. This includes RAW, TIME and
CPU. This one changes the format of the id headers.

4. PERF_RECORD_LOST_SAMPLES is for the sampling event, so it should
use the event's sample_type. But the event parsing looks up the
event using evlist->is_pos which is set for the first event.

5. I think we can remove some sample type (i.e. TID and CPU) from the
tracking event in most cases. ID(ENTIFIER) will be used for LOST_
SAMPLES and TIME is needed anyway. TID is might be used for SWITCH
but others already contain necessary information in the type. I
wish we could add id field to PERF_RECORD_LOST_SAMPLES and tid/pid
to PERF_RECORD_SWITCH.

Thanks,
Namhyung

>
> > regression has occurred then I think we should add the regression case
> > as a test.
>
> Sure, I thought about that as well, will try and have one shell test
> with that, referring to this case, for v6.11.
>
> - Arnaldo