Re: [PATCH] sched/tracing: append prev_state to tp args instead

From: Andrii Nakryiko
Date: Tue Apr 26 2022 - 11:51:46 EST

On Tue, Apr 26, 2022 at 5:28 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Fri, Apr 22, 2022 at 11:30:12AM -0700, Alexei Starovoitov wrote:
> > On Fri, Apr 22, 2022 at 10:22 AM Delyan Kratunov <delyank@xxxxxx> wrote:
> > >
> > > On Fri, 2022-04-22 at 13:09 +0200, Peter Zijlstra wrote:
> > > > And on the other hand; those users need to be fixed anyway, right?
> > > > Accessing prev->__state is equally broken.
> > >
> > > The users that access prev->__state would most likely have to be fixed, for sure.
> > >
> > > However, not all users access prev->__state. `offcputime` for example just takes a
> > > stack trace and associates it with the switched out task. This kind of user
> > > would continue working with the proposed patch.
> > >
> > > > If bpf wants to ride on them, it needs to suffer the pain of doing so.
> > >
> > > Sure, I'm just advocating for a fairly trivial patch to avoid some of the suffering,
> > > hopefully without being a burden to development. If that's not the case, then it's a
> > > clear no-go.
> >
> >
> > Namhyung just sent this patch set:
> >
> That has:
> + * recently task_struct->state renamed to __state so it made an incompatible
> + * change.
> git tells me:
> 2f064a59a11f ("sched: Change task_struct::state")
> is almost a year old by now. That don't qualify as recently in my book.
> That says that 'old kernels used to call this...'.
> > to add off-cpu profiling to perf.
> > It also hooks into sched_switch tracepoint.
> > Notice it deals with state->__state rename just fine.
> So I don't speak BPF much; it always takes me more time to make bpf work
> than to just hack up the kernel, which makes it hard to get motivated.
> However, it was not just a rename, state changed type too, which is why I
> did the rename, to make sure all users would get a compile fail and
> could adjust.
> If you're silently making it work by frobbing the name, you loose that.

In general, libbpf is trying to accommodate it as best as it can. It
will adjust load size automatically based on BTF information for
direct memory reads. For cases when users have to use
bpf_probe_read_kernel(), it's impossible to do this automatically, but
libbpf and BPF CO-RE provide primitives to accommodate changes to
field types. We have bpf_core_field_size() and
can read any bitfield or integer type properly and store them into

So yes, user will still have to test and validate their BPF programs
for new kernels, but they do have instruments to handle this in a
contained way in BPF code without awkward extra user-space side
feature detection.

With __state rename, it's been done, it's easy to accommodate. That
easiness of being able to deal with this change is one of the reasons
we never asked to do anything about that. It was rather mild
inconveniences for a bunch of tools and applications.

This argument reordering, as I mentioned before, is a much bigger deal
and much harder to deal with easily. As Alexei mentioned, we are
discussing extending BPF CO-RE to be able to accommodate even such
changes in pure BPF-side code, but we don't have that mechanism yet,
so if it's not too hard, we still kindly ask to consider appending a
new argument at the end.

> Specifically, task_struct::state used to be 'volatile long', while
> task_struct::__state is 'unsigned int'. As such, any user must now be
> very careful to use READ_ONCE(). I don't see that happening with just
> frobbing the name.
> Additinoally, by shrinking the field, I suppose BE systems get to keep
> the pieces?
> > But it will have a hard time without this patch
> > until we add all the extra CO-RE features to detect
> > and automatically adjust bpf progs when tracepoint
> > arguments order changed.
> Could be me, but silently making it work sounds like fail :/ There's a
> reason code changes, users need to adapt, not silently pretend stuff is
> as before.
> How will you know you need to fix your tool?

There is no denying that tracing BPF code successfully compiling
doesn't mean it's going to work correctly (like with any other code,
really). So testing is still mandatory. This is more about being able
to deal with such changes without having to maintain two separately
compiled BPF programs or doing extensive and expensive feature
detection in user-space code.

> > We will do it eventually, of course.
> > There will be additional work in llvm, libbpf, kernel, etc.
> > But for now I think it would be good to land Delyan's patch
> > to avoid unnecessary pain to all the users.
> >
> > Peter, do you mind?
> I suppose I can help out this time, but I really don't want to set a
> precedent for these things. Broken is broken.

We'd really appreciate it if you do help out this time. The intent is
not to set any kind of precedent. This ask is due to
sched_switch is a very popular target *and* this change can't be
easily detected (once detected it can be easily accommodated, but
detection is painful), so overall impact is disproportionate.

> The down-side for me is that the argument order no longer makes any
> sense.

The order changes only for tracing-related macros/functions, so
hopefully that doesn't make the rest of scheduler code illogical or