Re: [RFC 0/4] perf sched: Introduce schedstat tool

From: Ian Rogers
Date: Wed May 08 2024 - 18:21:50 EST


On Tue, May 7, 2024 at 11:05 PM Ravi Bangoria <ravi.bangoria@xxxxxxx> wrote:
>
> MOTIVATION
> ----------
>
> Existing `perf sched` is quite exhaustive and provides lot of insights
> into scheduler behavior but it quickly becomes impractical to use for
> long running or scheduler intensive workload. For ex, `perf sched record`
> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
> generates huge 56G perf.data for which perf takes ~137 mins to prepare
> and write it to disk [1].
>
> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
> and generates samples on a tracepoint hit, `perf sched schedstat record`
> takes snapshot of the /proc/schedstat file before and after the workload,
> i.e. there is zero interference on workload run. Also, it takes very
> minimal time to parse /proc/schedstat, convert it into perf samples and
> save those samples into perf.data file. Result perf.data file is much
> smaller. So, overall `perf sched schedstat record` is much more light-
> weight compare to `perf sched record`.
>
> We, internally at AMD, have been using this (a variant of this, known as
> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
> of any scheduler code changes[3][4].
>
> Please note that, this is not a replacement of perf sched record/report.
> The intended users of the new tool are scheduler developers, not regular
> users.
>
> USAGE
> -----
>
> # perf sched schedstat record
> # perf sched schedstat report
>
> Note: Although perf schedstat tool supports workload profiling syntax
> (i.e. -- <workload> ), the recorded profile is still systemwide since
> the /proc/schedstat is a systemwide file.
>
> HOW TO INTERPRET THE REPORT
> ---------------------------
>
> The schedstat report starts with total time profiling was active in
> terms of jiffies:
>
> ----------------------------------------------------------------------------------------------------
> Time elapsed (in jiffies) : 24537
> ----------------------------------------------------------------------------------------------------
>
> Next is cpu scheduling statistics. These are simple diffs of
> /proc/schedstat cpu lines along with description. The report also
> prints % relative to base stat.
>
> In the example below, schedule() left the cpu0 idle 98.19% of the time.
> 16.54% of total try_to_wake_up() was to wakeup local cpu. And, the total
> waittime by tasks on cpu0 is 0.49% of the total runtime by tasks on the
> same cpu.
>
> ----------------------------------------------------------------------------------------------------
> cpu: cpu0
> ----------------------------------------------------------------------------------------------------
> sched_yield() count : 0
> Legacy counter can be ignored : 0
> schedule() called : 17138
> schedule() left the processor idle : 16827 ( 98.19% )
> try_to_wake_up() was called : 508
> try_to_wake_up() was called to wake up the local cpu : 84 ( 16.54% )
> total runtime by tasks on this processor (in jiffies) : 2408959243
> total waittime by tasks on this processor (in jiffies) : 11731825 ( 0.49% )
> total timeslices run on this cpu : 311
> ----------------------------------------------------------------------------------------------------
>
> Next is load balancing statistics. For each of the sched domains
> (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
> the following three categories:
>
> 1) Idle Load Balance: Load balancing performed on behalf of a long
> idling cpu by some other cpu.
> 2) Busy Load Balance: Load balancing performed when the cpu was busy.
> 3) New Idle Balance : Load balancing performed when a cpu just became
> idle.
>
> Under each of these three categories, schedstat report provides
> different load balancing statistics. Along with direct stats, the
> report also contains derived metrics prefixed with *. Example:
>
> ----------------------------------------------------------------------------------------------------
> CPU 0 DOMAIN 0
> ----------------------------------------------------------------------------------------------------
> ------------------------------------------<Category idle>-------------------------------------------
> load_balance() count on cpu idle : 50 $ 490.74 $
> load_balance() found balanced on cpu idle : 42 $ 584.21 $
> load_balance() move task failed on cpu idle : 8 $ 3067.12 $
> imbalance sum on cpu idle : 8
> pull_task() count on cpu idle : 0
> pull_task() when target task was cache-hot on cpu idle : 0
> load_balance() failed to find busier queue on cpu idle : 0 $ 0.00 $
> load_balance() failed to find busier group on cpu idle : 42 $ 584.21 $
> *load_balance() success count on cpu idle : 0
> *avg task pulled per successful lb attempt (cpu idle) : 0.00
> ------------------------------------------<Category busy>-------------------------------------------
> load_balance() count on cpu busy : 2 $ 12268.50 $
> load_balance() found balanced on cpu busy : 2 $ 12268.50 $
> load_balance() move task failed on cpu busy : 0 $ 0.00 $
> imbalance sum on cpu busy : 0
> pull_task() count on cpu busy : 0
> pull_task() when target task was cache-hot on cpu busy : 0
> load_balance() failed to find busier queue on cpu busy : 0 $ 0.00 $
> load_balance() failed to find busier group on cpu busy : 1 $ 24537.00 $
> *load_balance() success count on cpu busy : 0
> *avg task pulled per successful lb attempt (cpu busy) : 0.00
> -----------------------------------------<Category newidle>-----------------------------------------
> load_balance() count on cpu newly idle : 427 $ 57.46 $
> load_balance() found balanced on cpu newly idle : 382 $ 64.23 $
> load_balance() move task failed on cpu newly idle : 45 $ 545.27 $
> imbalance sum on cpu newly idle : 48
> pull_task() count on cpu newly idle : 0
> pull_task() when target task was cache-hot on cpu newly idle : 0
> load_balance() failed to find busier queue on cpu newly idle : 0 $ 0.00 $
> load_balance() failed to find busier group on cpu newly idle : 382 $ 64.23 $
> *load_balance() success count on cpu newly idle : 0
> *avg task pulled per successful lb attempt (cpu newly idle) : 0.00
> ----------------------------------------------------------------------------------------------------
>
> Consider following line:
>
> load_balance() found balanced on cpu newly idle : 382 $ 64.23 $
>
> While profiling was active, the load-balancer found 382 times the load
> needs to be balanced on a newly idle CPU 0. Following value encapsulated
> inside $ is average jiffies between two events (24537 / 382 = 64.23).
>
> Next is active_load_balance() stats. alb did not trigger while I ran
> a workload so it's all 0s.
>
> ----------------------------------<Category active_load_balance()>----------------------------------
> active_load_balance() count : 0
> active_load_balance() move task failed : 0
> active_load_balance() successfully moved a task : 0
> ----------------------------------------------------------------------------------------------------
>
> Next are sched_balance_exec() and sched_balance_fork() stats. They are
> not used but we kept it in RFC just for legacy purpose. Unless opposed,
> we plan to remove them in next revision.
>
> Next is wakeup statistics. For every domain, the report also shows
> task-wakeup statistics. Example:
>
> --------------------------------------------<Wakeup Info>-------------------------------------------
> try_to_wake_up() awoke a task that last ran on a diff cpu : 12070
> try_to_wake_up() moved task because cache-cold on own cpu : 3324
> try_to_wake_up() started passive balancing : 0
> ----------------------------------------------------------------------------------------------------
>
> Same set of stats are reported for each cpu and each domain level.
>
>
> TODO:
> - This RFC series supports v15 layout of /proc/schedstat but v16 layout
> is already being pushed upstream. We are planning to include v16 as
> well in the next revision.
> - Currently schedstat tool provides statistics of only one run but we
> are planning to add `perf sched schedstat diff` which can compare
> the data of two different runs (possibly good and bad) and highlight
> where scheduler decisions are impacting workload performance.
> - perf sched schedstat records /proc/schedstat which is a cpu and domain
> level scheduler statistic. We are planning to add taskstat tool which
> reads task stats from procfs and generate scheduler statistic report
> at task granularity. this will probably a standalone tool, something
> like `perf sched taskstat record/report`.
> - /proc/schedstat shows cpumask in domain line to indicate a group of
> cpus that belong to the domain. Since we are not using domain<->cpumask
> data anywhere, we are not capturing it as part of perf sample. But we
> are planning to include it in the next revision.
> - We have tested the patch only on AMD machines, not on other platforms.

This is great! Is it possible to add some basic shell script testing:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell?h=perf-tools-next
for the sake of coverage, other platform testing, etc. ?

Thanks,
Ian

> - Some of the perf related features can be added to the schedstat sub-
> command as well: for ex, pipemode, -p <pid> option to profile running
> process etc. We are not planning to add TUI support however.
> - Code changes are for RFC and thus not optimal. Some of the places
> where code changes are not required for RFC are marked as /* FIXME */.
> - Documenting details about schedstat tool in perf-sched man page will
> also be done in next revision.
> - Checkpatch warnings are ignored for now.
>
> Patches are prepared on perf-tools-next/perf-tools-next (37862d6fdced).
> Although all changes are in tools/perf/, kernel is important since it's
> using v15 of /proc/schedstat.
>
> [1] https://youtu.be/lg-9aG2ajA0?t=283
> [2] https://github.com/AMDESE/sched-scoreboard
> [3] https://lore.kernel.org/lkml/c50bdbfe-02ce-c1bc-c761-c95f8e216ca0@amdcom/
> [4] https://lore.kernel.org/lkml/3e32bec6-5e59-c66a-7676-7d15df2c961c@amdcom/
>
> Ravi Bangoria (2):
> perf sched: Make `struct perf_sched sched` a global variable
> perf sched: Add template code for schedstat subcommand
>
> Swapnil Sapkal (2):
> perf sched schedstat: Add record and rawdump support
> perf sched schedstat: Add support for report subcommand
>
> tools/lib/perf/Documentation/libperf.txt | 2 +
> tools/lib/perf/Makefile | 2 +-
> tools/lib/perf/include/perf/event.h | 37 ++
> .../lib/perf/include/perf/schedstat-cpu-v15.h | 22 +
> .../perf/include/perf/schedstat-domain-v15.h | 121 ++++
> tools/perf/builtin-inject.c | 2 +
> tools/perf/builtin-sched.c | 558 ++++++++++++++++--
> tools/perf/util/event.c | 54 ++
> tools/perf/util/event.h | 2 +
> tools/perf/util/session.c | 44 ++
> tools/perf/util/synthetic-events.c | 170 ++++++
> tools/perf/util/synthetic-events.h | 4 +
> tools/perf/util/tool.h | 4 +-
> 13 files changed, 965 insertions(+), 57 deletions(-)
> create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v15.h
> create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v15.h
>
> --
> 2.44.0
>