Re: [PATCH 1/2] mm: page_alloc.c: Add tracepoints for slowpath

From: Steven Rostedt
Date: Fri Aug 05 2016 - 12:30:46 EST


On Fri, 5 Aug 2016 21:33:25 +0530
Janani Ravichandran <janani.rvchndrn@xxxxxxxxx> wrote:


> > Hmm, looking at the code, it appears setting tracing_thresh should
> > work. Could you show me exactly what you did?
> >
>
> Sure. I wanted to observe how long it took to allocate pages and also how
> long functions in the direct reclaim path took.
>
> So I did:
>
> echo function_graph > current_tracer
> echo __alloc_pages_nodemask > set_graph_function

Eliminate the update to set_graph_function step. I'm not sure this is
the cause, but it's not needed. Adding to set_ftrace_filter should be
sufficient.

> echo __alloc_pages_nodemask >> set_ftrace_filter

Use '>' and not '>>' as I don't know what you had in there before, as
I'm guessing you want this to only contain what you listed here.

> echo do_try_to_free_pages >> set_ftrace_filter
> echo shrink_zone >> set_ftrace_filter
> echo mem_cgroup_softlimit_reclaim >> set_ftrace_filter
> echo shrink_zone_memcg >> set_ftrace_filter
> echo shrink_slab >> set_ftrace_filter
> echo shrink_list >> set_ftrace_filter
> echo shrink_active_list >> set_ftrace_filter
> echo shrink_inactive_list >> set_ftrace_filter
>
> echo 20 > tracing_thresh

You probably want to clear the trace here, or set function_graph here
first. Because the function graph starts writing to the buffer
immediately.

>
> echo 1 > events/vmscan/mm_shrink_slab_start/enable
> echo 1 > events/vmscan/mm_shrink_slab_end/enable
> echo 1 > events/vmscan/mm_vmscan_direct_reclaim_begin/enable
> echo 1 > events/vmscan/mm_vmscan_direct_reclaim_end/enable
>
> Rik had suggested that itâd be good to write only the tracepoint info related to
> high latencies to disk. Because otherwise, thereâs a lot of information from
> the tracepoints. Filtering them out would greatly reduce disk I/O.
>
> What I first tried with begin/end tracepoints was simply use their timestamps
> to calculate duration and write the tracepoint info to disk only if it exceeded a
> certain number.
>
> The function graph output is great when
>
> a, no thresholds or tracepoints are set (with those aforementioned functions
> used as filters).
>
> Here is a sample output.
> https://github.com/Jananiravichandran/Analyzing-tracepoints/blob/master/no_tp_no_threshold.txt
>
> Lines 372 to 474 clearly show durations of functions and the output is helpful
> to observe how long each function took.
>
> b, no thresholds are set and the tracepoints are enabled to get some
> additional information.
>
> https://github.com/Jananiravichandran/Analyzing-tracepoints/blob/master/set_tp_no_threshold.txt
>
> Lines 785 to 916 here clearly show which tracepoints were invoked within which
> function calls as the beginning and end of functions are clearly marked.
>
> c, A threshold (20 in this case) is set but no tracepoints are enabled.
>
> https://github.com/Jananiravichandran/Analyzing-tracepoints/blob/master/no_tp_set_threshold.txt
>
> Lines 230 to 345 only show functions which exceeded the threshold.
>
> But thereâs a problem when a threshold is set and the tracepoints are enabled. It
> is difficult to know the subset of the total tracepoint info printed that was actually
> part of the functions that took longer than the threshold to execute (as there is no
> info indicating the beginning of functions unlike case b, mentioned above).
>
> For example,
> between lines 59 and 75 here:
>
> https://github.com/Jananiravichandran/Analyzing-tracepoints/blob/master/set_tp_set_threshold.txt

When threshold is set, the entry is not recorded, because it is only
showing the exit and the time it took in that function:

0) kswapd0-52 | + 54.141 us | } /* shrink_zone */

shrink_zone() took 54.141us.

The reason it doesn't record the entry is because it would fill the
entire buffer, if the threshold is never hit. One can't predict the
time in a function when you first enter that function.

>
> we can see that there was a call to shrink_zone() which took 54.141 us
> (greater than 20, the threshold). We also see a lot of tracepoint information
> printed between lines 59 and 74. But it is not possible for us to filter out
> only the tracepoint info that belongs to the shrink_zone() call that took 54.141
> us as it is possible that some of the information was printed as part of
> other shrink_zone() calls which took less than the threshold and therefore
> did not make it to the output file.

Exactly!

You need your own interpreter here. Perhaps a module that either reads
the tracepoints directly and registers a function graph tracer itself.
The trace events and function tracers are plugable. You don't need to
use the tracing system to use them. Just hook into them directly.

Things like the wakeup latency tracer does this. Look at
kernel/trace/trace_sched_wakeup.c for an example. It hooks into the
sched_wakeup and sched_switch tracepoints, and also has a way to use
function and function_graph tracing.



>
> So, itâs the filtering of anomalous data from tracepoints that I find difficult while
> using function_graphs.

Well, as I said, you can't filter on the entry tracepoint/function
because you don't know how long that function will take yet. You need
to have code that takes all information and only writes it out after
you hit the latency. That's going to require some custom coding.

>
> > Either way, adding your own function graph hook may be a good exercise
> > in seeing how all this works.
>
> Thank you for your suggestions regarding the function graph hook! I will try
> it and see if thereâs anything I can come up with!

Great! And note, even if you add extra tracepoints, you can hook
directly into them too. Again, see the trace_sched_wakeup.c for
examples.

-- Steve