Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity
From: Phil Auld
Date: Wed Sep 02 2020 - 10:09:51 EST
On Wed, Sep 02, 2020 at 12:44:42PM +0200 Dietmar Eggemann wrote:
> + Phil Auld <pauld@xxxxxxxxxx>
>
Thanks Dietmar.
> On 28/08/2020 19:26, Qais Yousef wrote:
> > On 08/28/20 19:10, Dietmar Eggemann wrote:
> >> On 28/08/2020 12:27, Qais Yousef wrote:
> >>> On 08/28/20 10:00, vincent.donnefort@xxxxxxx wrote:
> >>>> From: Vincent Donnefort <vincent.donnefort@xxxxxxx>
>
> [...]
>
> >> Can you remind me why we have all these helper functions like
> >> sched_trace_rq_cpu_capacity?
> >
> > struct rq is defined in kernel/sched/sched.h. It's not exported. Exporting
> > these helper functions was the agreement to help modules trace internal info.
> > By passing generic info you decouple the tracepoint from giving specific info
> > and allow the modules to extract all the info they need from the same
> > tracepoint. IE: if you need more than just cpu_capacity from this tracepoint,
> > you can get that without having to continuously add extra arguments everytime
> > you need an extra piece of info. Unless this info is not in the rq of course.
>
> I think this decoupling is not necessary. The natural place for those
> scheduler trace_event based on trace_points extension files is
> kernel/sched/ and here the internal sched.h can just be included.
>
> If someone really wants to build this as an out-of-tree module there is
> an easy way to make kernel/sched/sched.h visible.
>
It's not so much that we really _want_ to do this in an external module.
But we aren't adding more trace events and my (limited) knowledge of
BPF let me to the conclusion that its raw tracepoint functionality
requires full events. I didn't see any other way to do it.
We could put sched_tp in the tree under a debug CONFIG :)
> CFLAGS_sched_tp.o := -I$KERNEL_SRC/kernel/sched
>
> all:
> make -C $KERNEL_SRC M=$(PWD) modules
>
> This allowed me to build our trace_event extension module (sched_tp.c,
> sched_events.h) out-of-tree and I was able to get rid of all the
> sched_trace_foo() functions (in fair.c, include/linux/sched.h) and code
> there content directly in foo.c
>
> There are two things we would need exported from the kernel:
>
> (1) cfs_rq_tg_path() to print the path of a taskgroup cfs_rq or se.
>
> (2) sched_uclamp_used so uclamp_rq_util_with() can be used in
> sched_events.h.
>
> I put Phil Auld on cc because of his trace_point
> sched_update_nr_running_tp. I think Phil was using sched_tp as a base so
> I can't see an issue why we can't also remove sched_trace_rq_nr_running().
>
Our Perf team is now actively using this in downstream, using sched_tp, and
finding it very useful.
But I have no problem if this is all simpler in the kernel tree.
> >> In case we would let the extra code (which transforms trace points into
> >> trace events) know the internals of struct rq we could handle those
> >> things in the TRACE_EVENT and/or the register_trace_##name(void
> >> (*probe)(data_proto), void *data) thing.
> >> We always said when the internal things will change this extra code will
> >> break. So that's not an issue.
> >
> > The problem is that you need to export struct rq in a public header. Which we
> > don't want to do. I have been trying to find out how to use BTF so we can
> > remove these functions. Haven't gotten far away yet - but it should be doable
> > and it's a question of me finding enough time to understand what was currently
> > done and if I can re-use something or need to come up with extra infrastructure
> > first.
>
> Let's keep the footprint of these trace points as small as possible in
> the scheduler code.
>
> I'm putting the changes I described above in our monthly EAS integration
> right now and when this worked out nicely I will share the patches on lkml.
>
Sounds good, thanks!
Cheers,
Phil
--