Re: [PATCH v3 0/3] tracing: Read user data from futex system call trace event
From: Steven Rostedt
Date: Wed Apr 01 2026 - 16:20:08 EST
On Wed, 01 Apr 2026 21:31:19 +0200
Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> On Tue, Mar 31 2026 at 14:13, Steven Rostedt wrote:
> > We are looking at the performance of futexes and require a bit more
> > information when tracing them.
> >
> > The two patches here extend the system call reading of user space to
>
> s/two/three/ :)
Ah v1 had only two patches and this was cut and pasted from there.
>
> I understand what you are trying to achieve, but do we really need all
> the complexity of decoding and pretty printing in the kernel?
You could say the same for most tracepoints. ;-)
>
> Isn't it sufficient to store and expose the raw data and use post
> processing to make it readable?
Yes this is possible, and will also work too, as libtraceevent will be
updated to parse the raw data.
>
> I've been doing complex futex analysis for two decades with a small set
> of python scripts which translate raw text or binary trace data into
> human readable information.
>
> I agree that it's useful to have the actual timeout value and other data
> which is missing today, but that still does not require all this
> customized printing.
>
> The initial idea of having at least some information about the data
> entry (type, meaning etc.) in $event/format and use that for kernel text
> output and for user space tools to analyze a binary trace has been
> definitely the right way to go.
>
> But that now deviates because $event/format cannot carry that
> information you translate to in the kernel. It will still describe raw
> event data, no?
It still shows a bit:
name: sys_enter_futex
ID: 592
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int __syscall_nr; offset:8; size:4; signed:1;
field:u32 * uaddr; offset:16; size:8; signed:0;
field:int op; offset:24; size:8; signed:0;
field:u32 val; offset:32; size:8; signed:0;
field:const struct __kernel_timespec * utime; offset:40; size:8; signed:0;
field:u32 * uaddr2; offset:48; size:8; signed:0;
field:u32 val3; offset:56; size:8; signed:0;
field:u32 __value; offset:64; size:4; signed:0;
field:u32 __value2; offset:68; size:4; signed:0;
field:unsigned long __ts1; offset:72; size:8; signed:0;
field:unsigned long __ts2; offset:80; size:8; signed:0;
print fmt: "uaddr: 0x%lx (0x%lx) cmd=%s%s%s val: 0x%x timeout/val2: 0x%llx (%lu.%lu) uaddr2: 0x%lx (0x%lx) val3: 0x%x", REC->uaddr, REC->__value, __print_symbolic(REC->op & 0xfffffe7f, {0, "FUTEX_WAIT"} ,{1, "FUTEX_WAKE"} ,{2, "FUTEX_FD"} ,{3, "FUTEX_REQUEUE"} ,{4, "FUTEX_CMP_REQUEUE"} ,{5, "FUTEX_WAKE_OP"} ,{6, "FUTEX_LOCK_PI"} ,{7, "FUTEX_UNLOCK_PI"} ,{8, "FUTEX_TRYLOCK_PI"} ,{9, "FUTEX_WAIT_BITSET"} ,{10, "FUTEX_WAKE_BITSET"} ,{11, "FUTEX_WAIT_REQUEUE_PI"} ,{12, "FUTEX_CMP_REQUEUE_PI"} ,{13, "FUTEX_LOCK_PI2"} ), (REC->op & 128) ? "|FUTEX_PRIVATE_FLAG" : "", (REC->op & 256) ? "|FUTEX_CLOCK_REALTIME" : "", REC->val, REC->utime, REC->__ts1, REC->__ts2, REC->uaddr, REC->__value2, REC->val3
>
> So why not keeping the well known and working solution of identifying
> the data in the format, print it raw and leave the post processing to
> user space tools in case there is a need.
>
> You actually make it harder to do development. Look at the patch series
> related to robust futexes:
>
> https://lore.kernel.org/lkml/20260330114212.927686587@xxxxxxxxxx/
>
> So your decoding:
>
> > sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG)
>
> fails to decode the new flag and the usage of uaddr2 unless I go and add
> it in the first place _before_ working on the code. Right now it is just
> printing op as a hex value and it just works when a new bit is added.
>
> Stick 100 lines of python into tools/tracing and be done with it. I'm
> happy to contribute to that.
Well, it would be updated for trace-cmd not tools/tracing.
>
> Aside of that:
>
> Putting the decoder (futex_print_syscall) into the futex code itself
> is admittedly a smart move to offload the work of keeping that up to
> date to the people who are actually working on futexes.
>
> TBH, I'm not interested to deal with that at all. If you want this
> ftrace magic pretty printing, then stick it into kernel/trace or if
> there is a real technical reason (hint there is none) into
> kernel/futex/trace.c and take ownership of it. But please do not burden
> others with your fancy toy of the day.
v1 kept it all within the tracing subsystem, but Peter suggested that it be
closer to the syscall:
https://lore.kernel.org/all/20260304090748.GO606826@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
I'm happy to put it back and maintain it separately.
Or I can just keep the simple bits (the reading of user space), and not do
all the more fancy formatting. Basically dropping patch 2 and 3.
I've been using trace-cmd start / show for testing. But I could also move
the logic to libtraceevent, which would require using trace-cmd record
instead.
How much are you against the full series? Are you OK with it if it stays
within the tracing subsystem? Or would you prefer just keeping with patch 1
and dropping the other patches and doing that work in libtraceevent?
-- Steve