Re: [BUG] perf report: ordered events and flushing bug

From: Stephane Eranian
Date: Thu Mar 12 2015 - 15:23:13 EST


On Thu, Mar 12, 2015 at 3:13 PM, David Ahern <dsahern@xxxxxxxxx> wrote:
> On 3/12/15 1:05 PM, Stephane Eranian wrote:
>>
>> On Thu, Mar 12, 2015 at 5:02 AM, Adrian Hunter <adrian.hunter@xxxxxxxxx>
>> wrote:
>>>
>>> On 12/03/15 05:32, Stephane Eranian wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am working on the JIT support to improve the flow and have
>>>> perf record inject the MMAPs at the end of the collection. For
>>>> that I piggyback on the buildid pass. To avoid rewriting the entire
>>>> perf.data
>>>> file, I simply append the the MMAP records at the end of the file. And
>>>> that
>>>> puts them out-of-order in time systematically. But I thought it would be
>>>> okay
>>>> because perf report would sort them by timestamps again.
>>>>
>>>> Well, it does not in all cases! Why?
>>>>
>>>> Because of the round flushing. Based on how far out-of-order an event
>>>> is,
>>>> it may not be processed correctly because of round flushing. I believe
>>>> this
>>>> may only impact MMAP records. This is a serious issue because
>>>> mmaps drive symbolization of samples. If samples are processed without
>>>> the
>>>> proper dso mapping, then samples may not be symbolized or may be wrongly
>>>> symbolized.
>>>>
>>>> So far, the workaround I found was to set the oe->next_flush = 0 for
>>>> the ROUND mode.
>>>> In other words, do not flush anything until FINAL. To me, this is the
>>>> only sensible
>>>> way of avoiding this kind of problems. I am not sure I understand the
>>>> point of flushing
>>>> anyway, except to minimize memory footprint, maybe. But it does not
>>>> work with vastly
>>>> out-of-order mmaps.
>
>
> Rounds and flushing after them helps with the user experience -- at least
> for some commands. On systems with 1024 cpus perf data files get large
> quickly and the resulting analysis command can appear to hang for long
> periods (e.g., i have done 1 second data collections that generate 5GB files
> and take 30-45 minutes before perf shows any output).
>
> I have a patch to perf-record to allow a user to specify the poll timeout.
> That helps define the maximum time between 'rounds' on the record side and
> then commands that show each event (e.g, perf script) will show output much
> more quickly without the need to consume large amounts of memory or process
> the entire file before displaying events.
>
I can understand this but not at the cost of correctness. You have no guarantee
that events will appear in the perf.data with a fixed maximum time window, i.e.,
beyond the window, out-of-order events may be discarded.
There ought to be a better way of keeping the user aware of what's going on
with the progress bar.


>>>>
>>>> Do you have a better solution?
>>>
>>>
>>> You could hook the ordered event delivery (see Arnaldo's perf/core
>>> branch):
>>>
>>> if (tool->ordered_events) {
>>> inject->deliver = session->ordered_events.deliver;
>>> session->ordered_events.deliver = jit_mmap_deliver;
>>> }
>>>
>>> int jit_mmap_deliver(struct ordered_events *oe, struct ordered_event
>>> *event, struct perf_sample *sample)
>>> {
>>> struct perf_inject *inject = container_of(oe->tool, struct
>>> perf_inject, tool);
>>>
>>> if (next_jit_mmap_time < sample->time) {
>>> <synth mmap event>
>>> perf_session__deliver_synth_event(...);
>>> }
>>> return inject->deliver(oe, event, sample);
>>> }
>>>
>> I don't quite understand what this does?
>> Are you saying that in perf/core there is a new callback mechanism to
>> override the deliver phase
>> when you need ordered event and that callback could be used to avoid
>> the flushing?
>>
>
> I think the suggestion is to push the mmap events into the stream at the
> appropriate time slot.
>
> David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/