Re: [PATCH] user_events: Enable user processes to create and write to trace events

From: Beau Belgrave
Date: Wed Oct 06 2021 - 13:56:18 EST


On Thu, Oct 07, 2021 at 01:28:27AM +0900, Masami Hiramatsu wrote:
> > User processes often want to trace only when it's useful. To enable this
> > a set of pages are mapped into the user process space that indicate the
> > current state of the user events that have been registered. User
> > processes can check if their event is hooked to a trace/probe, and if it
> > is, emit the event data out via the write() syscall.
> >
> > Two files are added to tracefs to accomplish this:
> > user_events_mmap - This file is mmap'd into participating user mode
> > processes to indicate event status.
>
> It seems like the "method" is used for the file name of user_events_mmap,
> instead of what is the purpose of the file.
> What about "user_events_status" and show the informations (which program
> made events and what tracer is using the events) when you read the file?
> And when you mmap it, or when you ioctl to get mmapable fd, your program
> can directly monitor the flags.
>
This sounds good to me.

> > Psuedo code example of typical usage:
> > page_fd = open("user_events_mmap", O_RDWR);
> > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0);
> >
> > data_fd = open("user_events_data", O_RDWR);
> > data_id = ioctl(data_fd, DIAG_IOCSREG, "test");
>
> Hmm, if the data_id is same as the ID in events/*/*/format, it will
> be queried by libtraceevent or libftrace, isn't it?
> And also, if you can define the user-event via dynamic_event interface,
> it should be used instead of using ioctl on data channel.
>
These will not be the same ID. The data_id will be the offset within a
page to check for status. We don't want to give user mode processes
access to the entire tracefs, so we are isolating to a few files that
will get wider access.

> Oh, I thought that the each user events should not be accessible from other
> processes. And I got the reason, if you run different processes (instances)
> at the same time, those may need to share same events.
>
> In summary, I would like to suggest following interface to split
> the control, monitor and data channels.
>
> - dynamic_events : Add new 'u' command but derived from synth event (or share the parser)
> - user_event_status : monitor status by mmap, and showing human-readable status by read.
> query offset by ioctl.
Agree with the above, cool idea.

> - user_event_data : just write data (binary packet).
>
We still require the ioctl to register and cannot bet solely on
dynamic_events. We want a fast way to register events on process startup
and get the data_id back as fast as possible.

The other thing is we need ref counting to know if the event is busy.
Having the ID in the packet avoids having a fd per-event, but it also
makes ref counting process lifetime of each event quite hard.

> User written binary packet will be based on the event format, for example,
>
> echo "u:foo u64 mydata; char[] msg"
>
> /sys/kernel/debug/tracing # cat events/user_events/foo/format
> name: foo
> ID: 1234
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
>
> field:u64 mydata; offset:8; size:8; signed:0;
> field:__data_loc char[] msg; offset:16; size:4; signed:1;
>
> Then, the binary should be formatted as;
>
> [u32 ID][u64 mydata][u16 msg-len][u16 msg-offs][msg-data]
>
> Then user will generate a packet like;
>
> [1234][0xdeadbeef][12][16]["Hello world"]
>
> And write it to the user_event_data file.
>
> Kernel will verify len and offs if there is a dynamic array (string),
> and record common fields, update msg-offs (because it will be shift the
> size of common fields) and record user-data.
>
We want to avoid the kernel component touching any of the user data. For
example eBPF programs will parse the data directly from definitions
provided by the user. If the kernel side changed what the user expected
from their side the eBPF program would fail to decode. We also want
predicate filtering to work as cheap as possible. I would really like to
keep offset manipulation entirely in the user space to avoid confusion
across the various tracing mechanisms and avoid probing the user data
upon each call (eBPF programs only selectively probe in data).

> I also think we need a user-mode library for the above interface
> so that user can easily bind to each application and language.
I have a few colleagues testing some ideas out about this. Should I have
them reach out to you for ideas or collaboration?

Thanks,
-Beau