Re: [PATCH] tools/perf: Add wall-clock and parallelism profiling

From: Dmitry Vyukov
Date: Mon Jan 13 2025 - 07:26:19 EST


On Wed, 8 Jan 2025 at 09:34, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>
> On Wed, 8 Jan 2025 at 09:24, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
> >
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
> >
> > Currently perf only allows to profile CPU time. Perf (and all other
> > existing profilers to the best of my knowledge) does not allow profile
> > wall-clock time.
> >
> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where wall-clock profile is needed:
> > - optimzing build latency
> > - optimizing server request latency
> > - optimizing ML training/inference latency
> > - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
> >
> > This patch adds wall-clock and parallelization profiling.
> > See the added documentation and flags descriptions for details.
> >
> > Brief outline of the implementation:
> > - add context switch collection during record
> > - calculate number of threads running on CPUs (parallelism level)
> > during report
> > - divide each sample weight by the parallelism level
> > This effectively models that we were taking 1 sample per unit of
> > wall-clock time.
> >
> > The feature is added on an equal footing with the existing CPU profiling
> > rather than a separate mode enabled with special flags. The reasoning is
> > that users may not understand the problem and the meaning of numbers they
> > are seeing in the first place, so won't even realize that they may need
> > to be looking for some different profiling mode. When they are presented
> > with 2 sets of different numbers, they should start asking questions.
>
> Hi folks,
>
> Am I missing something and this is possible/known already?
>
> I understand this is a large change, and I am open to comments.
> I've also uploaded it to gerrit if you prefer to review there:
> https://linux-review.git.corp.google.com/c/linux/kernel/git/torvalds/linux/+/25608
>
> You may also checkout that branch and try it locally. It works on older kernels.
>
> What of this is testable within the current testing framework?
> Also how do I run tests? I failed to figure it out.
>
> Btw, the profile example in the docs is from a real kernel build on my machine.
> You can see how misleading the current profile is wrt latency.
>
> Or you can see what takes time in the perf make itself.
> (despite -j128, 73% of time was spent with 1 running thread,
> only a few percent of time was spent with high parallelism).
>
> Wallclock Overhead Parallelism / Command
> - 73.64% 6.96% 1
> + 28.53% 2.70% cc1
> + 17.93% 1.69% python3
> + 10.79% 1.02% ld
> - 7.49% 1.42% 2
> + 4.26% 0.81% cc1
> + 0.72% 0.14% ld
> + 0.68% 0.13% cc1plus
> ...
> - 1.33% 15.74% 125
> + 1.23% 14.50% cc1
> + 0.03% 0.33% gcc
> + 0.03% 0.32% sh


> [PATCH] tools/perf: Add wall-clock and parallelism profiling

Note to myself: need to change the subject to "perf report:".