Re: [PATCH v5 00/10] perf sched: Introduce stats tool

From: Ian Rogers

Date: Wed Jan 21 2026 - 12:12:33 EST


On Wed, Jan 21, 2026 at 8:33 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Thu, Jan 22, 2026 at 12:09:25AM +0800, Chen, Yu C wrote:
> > On 1/20/2026 1:58 AM, Swapnil Sapkal wrote:
> > > MOTIVATION
> > > ----------
> > >
> > > Existing `perf sched` is quite exhaustive and provides lot of insights
> > > into scheduler behavior but it quickly becomes impractical to use for
> > > long running or scheduler intensive workload. For ex, `perf sched record`
> > > has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
> > > on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
> > > generates huge 56G perf.data for which perf takes ~137 mins to prepare
> > > and write it to disk [1].
> > >
> > > Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
> > > and generates samples on a tracepoint hit, `perf sched stats record` takes
> > > snapshot of the /proc/schedstat file before and after the workload, i.e.
> > > there is almost zero interference on workload run. Also, it takes very
> > > minimal time to parse /proc/schedstat, convert it into perf samples and
> > > save those samples into perf.data file. Result perf.data file is much
> > > smaller. So, overall `perf sched stats record` is much more light weight
> > > compare to `perf sched record`.
> > >
> > > We, internally at AMD, have been using this (a variant of this, known as
> > > "sched-scoreboard"[2]) and found it to be very useful to analyse impact
> > > of any scheduler code changes[3][4]. Prateek used v2[5] of this patch
> > > series to report the analysis[6][7].
> > >
> > > Please note that, this is not a replacement of perf sched record/report.
> > > The intended users of the new tool are scheduler developers, not regular
> > > users.
> > >
> > > USAGE
> > > -----
> > >
> > > # perf sched stats record
> > > # perf sched stats report
> > > # perf sched stats diff
> > >
> > > Note: Although `perf sched stats` tool supports workload profiling syntax
> > > (i.e. -- <workload> ), the recorded profile is still systemwide since the
> > > /proc/schedstat is a systemwide file.
> > >
> >
> > I found this is useful for load balance analysis on my
> > 384 CPUs system with 6.19.0-rc1, please feel free to add
> >
> > Tested-by: Chen Yu <yu.c.chen@xxxxxxxxx>
>
> Yeah, I've used a previous version for a while, was very nice.
>
> Acked-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>

Acked-by: Ian Rogers <irogers@xxxxxxxxxx>

I'm still wondering if we can make some of the /proc/schedstat data
appear as tool events similar to proposals for networking and memory
tool events in:
https://lore.kernel.org/lkml/20260104011738.475680-1-irogers@xxxxxxxxxx/

Thanks,
Ian