Re: [PATCH 0/5] perf sched: Introduce stats tool

From: Sapkal, Swapnil
Date: Wed Sep 18 2024 - 09:20:16 EST


Hi James,

On 9/17/2024 4:05 PM, James Clark wrote:


On 16/09/2024 17:47, Ravi Bangoria wrote:
MOTIVATION
----------

Existing `perf sched` is quite exhaustive and provides lot of insights
into scheduler behavior but it quickly becomes impractical to use for
long running or scheduler intensive workload. For ex, `perf sched record`
has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
generates huge 56G perf.data for which perf takes ~137 mins to prepare
and write it to disk [1].

Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
and generates samples on a tracepoint hit, `perf sched stats record` takes
snapshot of the /proc/schedstat file before and after the workload, i.e.
there is almost zero interference on workload run. Also, it takes very
minimal time to parse /proc/schedstat, convert it into perf samples and
save those samples into perf.data file. Result perf.data file is much
smaller. So, overall `perf sched stats record` is much more light weight
compare to `perf sched record`.

We, internally at AMD, have been using this (a variant of this, known as
"sched-scoreboard"[2]) and found it to be very useful to analyse impact
of any scheduler code changes[3][4].

Please note that, this is not a replacement of perf sched record/report.
The intended users of the new tool are scheduler developers, not regular
users.

USAGE
-----

   # perf sched stats record
   # perf sched stats report

Note: Although `perf sched stats` tool supports workload profiling syntax
(i.e. -- <workload> ), the recorded profile is still systemwide since the
/proc/schedstat is a systemwide file.

HOW TO INTERPRET THE REPORT
---------------------------

The `perf sched stats report` starts with total time profiling was active
in terms of jiffies:

----------------------------------------------------------------------------------------------------
   Time elapsed (in jiffies)                                   : 24537
----------------------------------------------------------------------------------------------------

Next is CPU scheduling statistics. These are simple diffs of
/proc/schedstat CPU lines along with description. The report also
prints % relative to base stat.

In the example below, schedule() left the CPU0 idle 98.19% of the time.
16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total
waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the
same CPU.

----------------------------------------------------------------------------------------------------
   CPU 0
----------------------------------------------------------------------------------------------------
   sched_yield() count :           0
   Legacy counter can be ignored :           0
   schedule() called                                           : 17138
   schedule() left the processor idle                          : 16827 ( 98.19% )
   try_to_wake_up() was called :         508
   try_to_wake_up() was called to wake up the local cpu :          84 ( 16.54% )
   total runtime by tasks on this processor (in jiffies)       : 2408959243
   total waittime by tasks on this processor (in jiffies)      : 11731825 ( 0.49% )
   total timeslices run on this cpu :         311
----------------------------------------------------------------------------------------------------

Next is load balancing statistics. For each of the sched domains
(eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
the following three categories:

   1) Idle Load Balance: Load balancing performed on behalf of a long
                         idling CPU by some other CPU.
   2) Busy Load Balance: Load balancing performed when the CPU was busy.
   3) New Idle Balance : Load balancing performed when a CPU just became
                         idle.

Under each of these three categories, sched stats report provides
different load balancing statistics. Along with direct stats, the
report also contains derived metrics prefixed with *. Example:

----------------------------------------------------------------------------------------------------
   CPU 0 DOMAIN SMT CPUS <0, 64>
   ----------------------------------------- <Category idle> ------------------------------------------
   load_balance() count on cpu idle :          50   $      490.74 $
   load_balance() found balanced on cpu idle :          42   $      584.21 $
   load_balance() move task failed on cpu idle :           8   $     3067.12 $
   imbalance sum on cpu idle :           8
   pull_task() count on cpu idle :           0
   pull_task() when target task was cache-hot on cpu idle :           0
   load_balance() failed to find busier queue on cpu idle :           0   $        0.00 $
   load_balance() failed to find busier group on cpu idle :          42   $      584.21 $
   *load_balance() success count on cpu idle :           0
   *avg task pulled per successful lb attempt (cpu idle) :        0.00
   ----------------------------------------- <Category busy> ------------------------------------------
   load_balance() count on cpu busy :           2   $    12268.50 $
   load_balance() found balanced on cpu busy :           2   $    12268.50 $
   load_balance() move task failed on cpu busy :           0   $        0.00 $
   imbalance sum on cpu busy :           0
   pull_task() count on cpu busy :           0
   pull_task() when target task was cache-hot on cpu busy :           0
   load_balance() failed to find busier queue on cpu busy :           0   $        0.00 $
   load_balance() failed to find busier group on cpu busy :           1   $    24537.00 $
   *load_balance() success count on cpu busy :           0
   *avg task pulled per successful lb attempt (cpu busy) :        0.00
   ---------------------------------------- <Category newidle> ----------------------------------------
   load_balance() count on cpu newly idle :         427   $       57.46 $
   load_balance() found balanced on cpu newly idle :         382   $       64.23 $
   load_balance() move task failed on cpu newly idle :          45   $      545.27 $
   imbalance sum on cpu newly idle :          48
   pull_task() count on cpu newly idle :           0
   pull_task() when target task was cache-hot on cpu newly idle :           0
   load_balance() failed to find busier queue on cpu newly idle :           0   $        0.00 $
   load_balance() failed to find busier group on cpu newly idle :         382   $       64.23 $
   *load_balance() success count on cpu newly idle :           0
   *avg task pulled per successful lb attempt (cpu newly idle) :        0.00
----------------------------------------------------------------------------------------------------

Consider following line:

   load_balance() found balanced on cpu newly idle :         382    $      64.23 $

While profiling was active, the load-balancer found 382 times the load
needs to be balanced on a newly idle CPU 0. Following value encapsulated
inside $ is average jiffies between two events (24537 / 382 = 64.23).

This explanation of the $ fields is quite buried. Is there a way of making it clearer with a column header in the report? I think even if it was documented in the man pages it might not be that useful.

Thank you for the suggestion. I will add a header in the report to explain what each column values are representing.

There are also other jiffies fields that don't use $. Maybe if it was like this it could be semi self documenting:

----------------------------------------------------------------------
    Time elapsed (in jiffies)               :        $  24537        $  ----------------------------------------------------------------------

------------------<Category newidle> ---------------------------------
  load_balance() count on cpu newly idle    :   427  $     57.46 avg $
----------------------------------------------------------------------


Other than that:

Tested-by: James Clark <james.clark@xxxxxxxxxx>

Ack.

--
Thanks and Regards,
Swapnil