Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

From: Ze Gao
Date: Thu Aug 03 2023 - 23:19:51 EST


On Fri, Aug 4, 2023 at 10:38 AM Ze Gao <zegao2021@xxxxxxxxx> wrote:
>
> On Fri, Aug 4, 2023 at 10:21 AM Ze Gao <zegao2021@xxxxxxxxx> wrote:
> >
> > On Thu, Aug 3, 2023 at 11:10 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> > >
> > > On Thu, 3 Aug 2023 04:33:48 -0400
> > > Ze Gao <zegao2021@xxxxxxxxx> wrote:
> > >
> > > > Update state char array and then remove unused and stale
> > > > macros, which are kernel internal representations and not
> > > > encouraged to use anymore.
> > > >
> > > > Signed-off-by: Ze Gao <zegao@xxxxxxxxxxx>
> > > > ---
> > > > tools/perf/builtin-sched.c | 13 +------------
> > > > 1 file changed, 1 insertion(+), 12 deletions(-)
> > > >
> > > > diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> > > > index 9ab300b6f131..8dc8f071721c 100644
> > > > --- a/tools/perf/builtin-sched.c
> > > > +++ b/tools/perf/builtin-sched.c
> > > > @@ -92,23 +92,12 @@ struct sched_atom {
> > > > struct task_desc *wakee;
> > > > };
> > > >
> > > > -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> > > > +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
> > >
> > > Thinking about this more, this will always be wrong. Changing it just works
> > > for the kernel you made the change for, but if it is run on another kernel,
> > > it's broken again.
> >
> > Indeed. There is no easy way to maintain backward compatibility unless
> > we stop using this bizarre 'prev_state' field. Basically all its users suffer
> > from this. That's why I believe this needs a fix to alert people does not
> > use 'prev_state' anymore.
> >
> > > I actually wrote code once that basically just did a:
> > >
> > > struct trace_seq s;
> > >
> > > trace_seq_init(&s);
> > > tep_print_event(tep, &s, record, "%s", TEP_PRINT_INFO);
> > >
> > > then searched s.buffer for "prev_state=%s ", to find the state character.
> > >
> > > That's because the kernel should always be up to date (and why I said I
> > > needed that string in the print_fmt).
> >
> > Turing to building the state char array from print fmt string dynamically
> > is a great idea. :)

I realize this is not perfect as well after second thoughts, since this does not
take offline use of perf into consideration. People might run perf on different
machines than where the perf.data gets recorded, in which way what we get
from /sys/kernel/debug/tracing/events/sched/sched_switch/format is likely
different from the perf.data.

So let's parse it from TEP_PRINT_INFO of each record instead of building
the state char array and rely on 'prev_state' again. At least this fix all tools
that have TEP_PRINT_INFO available.

Thanks,
Ze



> > > As perf has a tep handle, this could be a helper function to extract the
> > > state if needed, and get rind of relying on the above character array.
> >
> > I'll figure out how to make it happen.
> >
> > BTW, my last concern is that is there any better way to notice userspace to
> > avoid interpreting task state out of 'prev_state'. Because the awkward thing
> > happens again.
>
> By userspace, I mean all tools consume 'prev_state' but don't have print fmt
> available, taking bpf tracepoint for example.
>
> Regards,
> Ze
>
> > Thanks,
> > Ze
> >
> > > -- Steve
> > >
> > >
> > > >
> > > > /* task state bitmask, copied from include/linux/sched.h */
> > > > #define TASK_RUNNING 0
> > > > #define TASK_INTERRUPTIBLE 1
> > > > #define TASK_UNINTERRUPTIBLE 2
> > > > -#define __TASK_STOPPED 4
> > > > -#define __TASK_TRACED 8
> > > > -/* in tsk->exit_state */
> > > > -#define EXIT_DEAD 16
> > > > -#define EXIT_ZOMBIE 32
> > > > -#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
> > > > -/* in tsk->state again */
> > > > -#define TASK_DEAD 64
> > > > -#define TASK_WAKEKILL 128
> > > > -#define TASK_WAKING 256
> > > > -#define TASK_PARKED 512
> > > >
> > > > enum thread_state {
> > > > THREAD_SLEEPING = 0,
> > >