Re: [PATCH v3 1/5] perf cpumap: If the die_id is missing use socket/physical_package_id

From: Namhyung Kim
Date: Thu Dec 19 2024 - 00:32:22 EST


On Wed, Dec 18, 2024 at 09:42:27AM -0800, Ian Rogers wrote:
> On Wed, Dec 18, 2024 at 4:04 AM James Clark <james.clark@xxxxxxxxxx> wrote:
> >
> >
> >
> > On 16/12/2024 11:24 pm, Ian Rogers wrote:
> > > An error value for a missing die_id may be written into things like
> > > the cpu_topology_map. As the topology needs to be fully written out,
> > > including the die_id, to allow perf.data file features to be aligned
> > > we can't allow error values to be written out. Instead base the
> > > missing die_id value off of the socket/physical_package_id assuming
> > > they correlate 1:1.
> > >
> > > Signed-off-by: Ian Rogers <irogers@xxxxxxxxxx>
> > > ---
> > > tools/perf/util/cpumap.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
> > > index 27094211edd8..d362272f8466 100644
> > > --- a/tools/perf/util/cpumap.c
> > > +++ b/tools/perf/util/cpumap.c
> > > @@ -283,7 +283,8 @@ int cpu__get_die_id(struct perf_cpu cpu)
> > > {
> > > int value, ret = cpu__get_topology_int(cpu.cpu, "die_id", &value);
> > >
> > > - return ret ?: value;
> > > + /* If die_id is missing fallback on using the socket/physical_package_id. */
> > > + return ret || value < 0 ? cpu__get_socket_id(cpu) : value;
> > > }
> > >
> > > struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data)
> >
> > Hi Ian,
> >
> > I sent a fix for the same or a similar problem here [1]. For this one
> > I'm not sure why we'd want to use the socket ID for die when it's always
> > been 0 for not present. I wonder if this change is mingling two things:
> > fixing the negative error value appearing and replacing die with socket ID.
> >
> > Personally I would prefer to keep the 0 to fix the error value, that way
> > nobody gets surprised by the change.
> >
> > Also it looks like cpu__get_cluster_id() can suffer from the same issue,
> > and if we do it this way we should drop these as they aren't valid anymore:
> >
> > /* There is no die_id on legacy system. */
> > if (die < 0)
> > die = 0;
>
> I think this breaks the assumption here:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/expr.c?h=perf-tools-next#n244

Hmm.. I'm not sure how it worked before. The code is already there and
it just changed the condition from == -1 to < 0, right?

Thanks,
Namhyung


> hence using the number of packages. I think we need to make sure there
> is order otherwise we'll end up with "if no dies value present" or "if
> dies value is 0 ignore" everywhere. What is strange to me is that ARM
> may have the die_id present but have it contain the value -1.
>
> > And last minor comment this could do with a fixes: 05be17eed774
>
> Thanks,
> Ian
>
> > [1]:
> > https://lore.kernel.org/linux-perf-users/20241218115552.912517-1-james.clark@xxxxxxxxxx/T/#u
> >