Re: ftrace global trace_pipe_raw

From: Claudio
Date: Wed Dec 19 2018 - 06:42:27 EST


Hi Steven,

going back to this old theme to clarify a bit what I was trying to achieve:

On 07/24/2018 04:23 PM, Steven Rostedt wrote:
> On Tue, 24 Jul 2018 11:58:18 +0200
> Claudio <claudio.fontana@xxxxxxxxx> wrote:
>
>> Hello Steven,
>>
>> I am doing correlation of linux sched events, following all tasks between cpus,
>> and one thing that would be really convenient would be to have a global
>> trace_pipe_raw, in addition to the per-cpu ones, with already sorted events.

I think that I asked for the wrong thing, since I did not understand how the implementation worked.
Which lead to your response, thank you for the clarification.

>>
>> I would imagine the core functionality is already available, since trace_pipe
>> in the tracing directory already shows all events regardless of CPU, and so
>> it would be a matter of doing the same for trace_pipe_raw.
>
> The difference between trace_pipe and trace_pipe_raw is that trace_pipe
> is post processed, and reads the per CPU buffers and interleaves them
> one event at a time. The trace_pipe_raw just sends you the raw
> unprocessed data directly from the buffers, which are grouped per CPU.

I think that what I am looking for, to improve the performance of our system,
is a post processed stream of binary entry data, already merged from all CPUs
and sorted per timestamp, in the same way that it is done for textual output
in __find_next_entry:

for_each_tracing_cpu(cpu) {

if (ring_buffer_empty_cpu(buffer, cpu))
continue;

ent = peek_next_entry(iter, cpu, &ts, &lost_events);

/*
* Pick the entry with the smallest timestamp:
*/
if (ent && (!next || ts < next_ts)) {
next = ent;
next_cpu = cpu;
next_ts = ts;
next_lost = lost_events;
next_size = iter->ent_size;
}
}

We first tried to use the textual output directly, but this lead to
unacceptable overheads in parsing the text.

Please correct me if I do not understand, however it seems to me that it
would be possible do the same kind of post processing including generating
a sorted stream of entries, just avoiding the text output formatting,
and outputting the binary data of the entry directly, which would be way
more efficient to consume directly from user space correlators.

But maybe this is not a general enough requirement to be acceptable for
implementing directly into the kernel?

We have the requirement of using the OS tracing events, including
scheduling events, to react from software immediately
(vs doing after-the-fact analysis).

Thank you for your comment on this and I wish you nice holidays.

Ciao,

Claudio