Re: [PATCH] tools/perf: Add wall-clock and parallelism profiling

From: Dmitry Vyukov
Date: Wed Jan 08 2025 - 03:35:26 EST


On Wed, 8 Jan 2025 at 09:24, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>
> There are two notions of time: wall-clock time and CPU time.
> For a single-threaded program, or a program running on a single-core
> machine, these notions are the same. However, for a multi-threaded/
> multi-process program running on a multi-core machine, these notions are
> significantly different. Each second of wall-clock time we have
> number-of-cores seconds of CPU time.
>
> Currently perf only allows to profile CPU time. Perf (and all other
> existing profilers to the best of my knowledge) does not allow profile
> wall-clock time.
>
> Optimizing CPU overhead is useful to improve 'throughput', while
> optimizing wall-clock overhead is useful to improve 'latency'.
> These profiles are complementary and are not interchangeable.
> Examples of where wall-clock profile is needed:
> - optimzing build latency
> - optimizing server request latency
> - optimizing ML training/inference latency
> - optimizing running time of any command line program
>
> CPU profile is useless for these use cases at best (if a user understands
> the difference), or misleading at worst (if a user tries to use a wrong
> profile for a job).
>
> This patch adds wall-clock and parallelization profiling.
> See the added documentation and flags descriptions for details.
>
> Brief outline of the implementation:
> - add context switch collection during record
> - calculate number of threads running on CPUs (parallelism level)
> during report
> - divide each sample weight by the parallelism level
> This effectively models that we were taking 1 sample per unit of
> wall-clock time.
>
> The feature is added on an equal footing with the existing CPU profiling
> rather than a separate mode enabled with special flags. The reasoning is
> that users may not understand the problem and the meaning of numbers they
> are seeing in the first place, so won't even realize that they may need
> to be looking for some different profiling mode. When they are presented
> with 2 sets of different numbers, they should start asking questions.

Hi folks,

Am I missing something and this is possible/known already?

I understand this is a large change, and I am open to comments.
I've also uploaded it to gerrit if you prefer to review there:
https://linux-review.git.corp.google.com/c/linux/kernel/git/torvalds/linux/+/25608

You may also checkout that branch and try it locally. It works on older kernels.

What of this is testable within the current testing framework?
Also how do I run tests? I failed to figure it out.

Btw, the profile example in the docs is from a real kernel build on my machine.
You can see how misleading the current profile is wrt latency.

Or you can see what takes time in the perf make itself.
(despite -j128, 73% of time was spent with 1 running thread,
only a few percent of time was spent with high parallelism).

Wallclock Overhead Parallelism / Command
- 73.64% 6.96% 1
+ 28.53% 2.70% cc1
+ 17.93% 1.69% python3
+ 10.79% 1.02% ld
- 7.49% 1.42% 2
+ 4.26% 0.81% cc1
+ 0.72% 0.14% ld
+ 0.68% 0.13% cc1plus
...
- 1.33% 15.74% 125
+ 1.23% 14.50% cc1
+ 0.03% 0.33% gcc
+ 0.03% 0.32% sh