Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi

From: Alexey Budankov
Date: Mon May 29 2017 - 06:46:47 EST


On 29.05.2017 13:33, Peter Zijlstra wrote:
On Mon, May 29, 2017 at 12:24:53PM +0300, Alexey Budankov wrote:
On 29.05.2017 10:45, Peter Zijlstra wrote:
On Sat, May 27, 2017 at 02:19:51PM +0300, Alexey Budankov wrote:
Solution:

cpu indexed trees for perf_event_context::pinned_groups and
perf_event_context::flexible_groups lists are introduced. Every tree node
keeps a list of groups allocated for the same cpu. A tree references only
groups located at the appropriate group list. The tree provides capability
to iterate over groups allocated for a specific cpu only, what is exactly
required by multiplexing timer interrupt handler. The handler runs per-cpu
and enables/disables groups using group_sched_in()/group_sched_out() that
call event_filter_match() function filtering out groups allocated for cpus
different from the one executing the handler. Additionally for every
filtered out group group_sched_out() updates tstamps values to the current
interrupt time. This updating work is now done only once by
update_context_time() called by ctx_sched_out() before cpu groups
iteration. For this trick to work it is required that tstamps of filtered
out events would point to perf_event_context::tstamp_data object instead
of perf_event::tstamp_data ones, as it is initialized from an event
allocation. tstamps references are switched by
group_sched_in()/group_sched_out() every time a group is checked for its
suitability for currently running cpu. When a thread enters some cpu on
a context switch a long run through pinned and flexible groups is
performed by perf_event_sched_in(, mux=0) with new parameter mux set to 0
and filtered out groups tstamps are switched to
perf_event_context::tstamp_data object. Then a series of multiplexing
interrupts happens and the handler rotates the flexible groups calling
ctx_sched_out(,mux=1)/perf_event_sched_in(,mux=1) iterating over the cpu
tree lists only and avoiding long runs through the complete group lists.
This is where speedup comes from. Eventually when the thread leaves the cpu
ctx_sched_out(,mux=0) is called restoring tstamps pointers to the events'
perf_event::tstamp_data objects.

This is unreadable.. Please use whitespace.

Do you mean do NOT use whitespaces? Could you explain in more detail what
you mean?

No, add _more_ whitespace. Use things like paragraphs and such. Reading
a massive blob of text like that is painful. Also use full and complete
sentences. For example, the very first sentence:

'cpu indexed trees for perf_event_context::pinned_groups and
perf_event_context::flexible_groups lists are introduced.'

feels incomplete and leaves one wondering what for etc.. And it only
gets worse.


Makes sense. Will do. Thanks.


Yeah, this doesn't go into a changelog. Have you _ever_ seen a changelog
with such crud in?


I have not. Could you please advise how to format this information to be
suitable for changelog, if it's required at all?

Don't include it. It should be fairly obvious from the diff itself what
changed after all.


Ok. Clear.