Re: [PATCH v4 00/11] perf sched: Introduce stats tool

From: Ian Rogers

Date: Fri Dec 12 2025 - 00:11:36 EST

On Thu, Dec 11, 2025 at 7:43 PM Ravi Bangoria <ravi.bangoria@xxxxxxx> wrote:
>
> Hi Ian,
>
> >>> Next is CPU scheduling statistics. These are simple diffs of
> >>> /proc/schedstat CPU lines along with description. The report also
> >>> prints % relative to base stat.
> >
> > I wonder if this is similar to user_time and system_time:
> > ```
> > $ perf list
> > ...
> > tool:
> > ...
> > system_time
> > [System/kernel time in nanoseconds. Unit: tool]
> > ...
> > user_time
> > [User (non-kernel) time in nanoseconds. Unit: tool]
> > ...
> > ```
> > These events are implemented by reading /proc/stat and /proc/pid/stat:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/tool_pmu.c?h=perf-tools-next#n267
> >
> > As they are events then they can appear in perf stat output and also
> > within metrics.
>
> Create synthesized events for each field of /proc/schedstat?
>
> Your idea is interesting and, I suppose, will work best when we care
> about individual counters. However, for the "perf sched stats" tool,
> I see atleast two challenges:
>
> 1. One of the design goal of "perf sched stats" was to keep the
> overhead low. Currently, it reads /proc/schedstat once at the
> beginning and once at the end. Switching to per-counter events
> would require opening, reading and closing a large number of
> events which would incur significant overhead.
>
> 2. Taking a snapshot in one go allows us to correlate counts easily.
> Using synthetic events would force us to read each counter
> individually, making cross-counter correlation impossible.

Thanks Ravi, those are interesting problems. There are similar
problems with just reading regular counters. For example, with the
problem in this series:
https://lore.kernel.org/lkml/20251113180517.44096-1-irogers@xxxxxxxxxx/
that was reduced to just the remaining:
https://lore.kernel.org/lkml/20251118211326.1840989-1-irogers@xxxxxxxxxx/
we could do a better bandwidth calculation if duration_time were read
along with the uncore counters. Perhaps we can have say a "wall-clock"
software counter (ie like cpu-clock and task-clock) to allow that and
allow the group of events to be read in one go as optimized here:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evsel.c?h=perf-tools-next#n1910

So maybe there is potential for a read group type optimization of tool
like counters, to do something similar to what you are doing here.
Anyway, that's a different set of things to do and shouldn't inhibit
trying to get this series to land.

Thanks,
Ian

> Thanks,
> Ravi