On Thu, Dec 19, 2024 at 09:20:39AM -0800, Ian Rogers wrote:>>>>>> Hmm.. I'm not sure how it worked before. The code is already there and
On Wed, Dec 18, 2024 at 9:32 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
On Wed, Dec 18, 2024 at 09:42:27AM -0800, Ian Rogers wrote:
On Wed, Dec 18, 2024 at 4:04 AM James Clark <james.clark@xxxxxxxxxx> wrote:
On 16/12/2024 11:24 pm, Ian Rogers wrote:
An error value for a missing die_id may be written into things like
the cpu_topology_map. As the topology needs to be fully written out,
including the die_id, to allow perf.data file features to be aligned
we can't allow error values to be written out. Instead base the
missing die_id value off of the socket/physical_package_id assuming
they correlate 1:1.
Signed-off-by: Ian Rogers <irogers@xxxxxxxxxx>
---
tools/perf/util/cpumap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index 27094211edd8..d362272f8466 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -283,7 +283,8 @@ int cpu__get_die_id(struct perf_cpu cpu)
{
int value, ret = cpu__get_topology_int(cpu.cpu, "die_id", &value);
- return ret ?: value;
+ /* If die_id is missing fallback on using the socket/physical_package_id. */
+ return ret || value < 0 ? cpu__get_socket_id(cpu) : value;
}
struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data)
Hi Ian,
I sent a fix for the same or a similar problem here [1]. For this one
I'm not sure why we'd want to use the socket ID for die when it's always
been 0 for not present. I wonder if this change is mingling two things:
fixing the negative error value appearing and replacing die with socket ID.
Personally I would prefer to keep the 0 to fix the error value, that way
nobody gets surprised by the change.
Also it looks like cpu__get_cluster_id() can suffer from the same issue,
and if we do it this way we should drop these as they aren't valid anymore:
/* There is no die_id on legacy system. */
if (die < 0)
die = 0;
I think this breaks the assumption here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/expr.c?h=perf-tools-next#n244
it just changed the condition from == -1 to < 0, right?
You'd need to be testing on a multi-socket machine to see the issue.
If you had say a dual socket Ampere chip and the die_id was missing,
does it make sense for there to be two sockets/packages but only 1
die? I think it is best we assume 1 die per socket when the die_id is
missing, and to some crippled extent (because of the s390 workaround)
the expr test is doing the sanity check.
AFAICS die ID is always used with socket ID already so I guess it means
an ID inside a socket.
Thanks,
Namhyung