Re: [PATCH] user_events: Enable user processes to create and write to trace events
From: Beau Belgrave
Date: Thu Oct 07 2021 - 12:22:11 EST
On Thu, Oct 07, 2021 at 11:17:38PM +0900, Masami Hiramatsu wrote:
> > > > Psuedo code example of typical usage:
> > > > page_fd = open("user_events_mmap", O_RDWR);
> > > > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0);
> > > >
> > > > data_fd = open("user_events_data", O_RDWR);
> > > > data_id = ioctl(data_fd, DIAG_IOCSREG, "test");
> > >
> > > Hmm, if the data_id is same as the ID in events/*/*/format, it will
> > > be queried by libtraceevent or libftrace, isn't it?
> > > And also, if you can define the user-event via dynamic_event interface,
> > > it should be used instead of using ioctl on data channel.
> > >
> > These will not be the same ID. The data_id will be the offset within a
> > page to check for status. We don't want to give user mode processes
> > access to the entire tracefs, so we are isolating to a few files that
> > will get wider access.
>
> OK, but how to query the id of the event which has been made
> from writing dynamic_events file? (maybe query from name by
> ioctl?)
>
Yes, you can either do ioctl or with the v2 patch you can now just cat
user_events_status as you indicated previously to do :)
> > We still require the ioctl to register and cannot bet solely on
> > dynamic_events. We want a fast way to register events on process startup
> > and get the data_id back as fast as possible.
>
> What about using ioctl on 'user_event_status' file?
> But in this case, the name is not showing the function.
>
> Then, what about renaming it as simply 'user_events'? :)
> This file will show the current user events as similar to synth_events,
> in addition, it will show the current status in "# comment".
>
> $ echo "user1 u32 arg1; u64 arg2" > user_events
> $ cat user_events
> user1 u32 arg1; u64 arg2
> $ echo 1 > events/user_events/user1/enable
> $ cat user_events
> user1 u32 arg1; u64 arg2 # Used by ftrace
>
> In addition,
> - user-process can do ioctl() to query offset instead of id from name.
> - user-process can do mmap() the file to monitor the status.
>
Please see v2 patch, I do this pattern except it's '(Used by ftrace)'
instead of '# Used by ftrace'.
Format is id:name status
> > The other thing is we need ref counting to know if the event is busy.
> > Having the ID in the packet avoids having a fd per-event, but it also
> > makes ref counting process lifetime of each event quite hard.
>
> Hmm, I don't think so. You can use an array of the pointer to
> events on the private data of the struct file.
> When you add (or start using) an event (this is identified by ioctl),
> you can increment the event refcount and add it to the array.
> When the file is closed (in exiting process), it will loop on the
> array and decrement the refcount for each event.
> Then, after all tracers disabled the event, ftrace can remove the
> event in background (unless it is defined through 'dynamic_events' or
> 'user_events').
>
Yes, I didn't say it's impossible :) It's quite hard and takes a lot
more management. I don't see a clear benefit to that approach, why is it
better than an fd lifetime? Not trying to be difficult, just trying to
be pragmatic about what approach is best.
> > We want to avoid the kernel component touching any of the user data. For
> > example eBPF programs will parse the data directly from definitions
> > provided by the user. If the kernel side changed what the user expected
> > from their side the eBPF program would fail to decode.
>
> Sorry, that's not doable unless we introduce a new __rel_loc__ attribute
> to the ftrace (and libtraceevent). With that, the dynamic data
> (like string) offset can be written as the offset from itself.
>
With this patch I have no problem getting hist, trigger and filter to
work since the user is providing all correct offsets. Can you help me
understand why this won't work? (It appears to work well without the
kernel modifying user provided memory).
> > > [1234][0xdeadbeef][12][0]["Hello world"]
>
> Anyway, as far as you are using trace event the data format itself is
> the contract between kernel and user, because *kernel* also has to
> decode it (e.g. user will read the 'trace' file.) and other tools
> like perf has to decode it too.
>
> The kernel will not modify the data (again with __rel_loc__ attr) but
> it will add the common_* field and event ID as a header to its buffer
> when record it.
>
Adding the common fields is fine, and works well with the current patch.
I've been able to decode in ftrace, perf and eBPF without any issues for
common types. __data_loc needs user mode cooperation, but no other type
does. I've also tested hist, filter and trigger and all appear to work
as I understand them to.
> > We also want
> > predicate filtering to work as cheap as possible. I would really like to
> > keep offset manipulation entirely in the user space to avoid confusion
> > across the various tracing mechanisms and avoid probing the user data
> > upon each call (eBPF programs only selectively probe in data).
>
> OK, so let's add __rel_loc__ attribute. The rel_loc type will be
>
> struct rel_loc {
> uint16_t len; /* The data size (including '\0' if string )*/
> uint16_t offs; /* The offset of actual data from this field */
> } __packed;
>
> Hmm, btw, this will be good for probe events... I don't need to pass
> the base address with this attribute.
>
What's the difference between __rel_loc__ and __data_loc? Seems like
instead of just offset it's length + offset?
Thanks,
-Beau