Re: [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
From: Alexey Budankov
Date: Wed Aug 23 2017 - 04:54:20 EST
On 22.08.2017 23:21, Peter Zijlstra wrote:
> On Fri, Aug 18, 2017 at 08:17:15AM +0300, Alexey Budankov wrote:
>> Hi,
>
> Please don't post new versions in reply to old versions, that gets them
> lost in thread sorted views.
Accepted. Actually I followed these recommendations:
https://git-scm.com/docs/git-send-email
and they say it needs to be like this:
[PATCH 0/2] Here is what I did...
[PATCH 1/2] Clean up and tests
[PATCH 2/2] Implementation
[PATCH v2 0/3] Here is a reroll
[PATCH v2 1/3] Clean up
[PATCH v2 2/3] New tests
[PATCH v2 3/3] Implementation
So I made v7 a descendant of v6.
Do you prefer new version be at the same level as the previous
versions like this?
[PATCH 0/2] Here is what I did...
[PATCH 1/2] Clean up and tests
[PATCH 2/2] Implementation
[PATCH v2 0/3] Here is a reroll
[PATCH v2 1/3] Clean up
[PATCH v2 2/3] New tests
[PATCH v2 3/3] Implementation
>
>> This patch set v7 moves event groups into rb trees and implements
>> skipping to the current CPU's list on hrtimer interrupt.
>
> Does this depend on your timekeeping rework posted in that v6 thread?
This v7 includes timekeeping rework so it is complete patch set
addressing your earlier concerns. The bunch of changes became
smaller so we are going right way.
I tested v7 thru several nights on Xeon Phi under fuzzer like this:
for ((i=0;i<1000;i=i+1)) do ./perf_fuzzer; done
and there were no crashes or hangs in perf code the machine is still alive.
> If so, I would have expected to see that as part of these patches, if
> not, I'm confused, because part of the problem was that we currently
> need to update times for events we don't want to schedule etc..
Yes, you are right. We need to update times for that events and
we still do, but on-demand - read() syscall. Doing so we may skip
slow iterating of the whole bunch of events and get performance boost.
We may skip updating the times every timer interrupt and do it only
on read() call and on thread context switch out when
update_event_times() is actually called.
>
>> Events allocated for the same CPU are still kept in a linked list
>> of the event directly attached to the tree because it is unclear
>> how to implement fast iteration thru events allocated for
>> the same CPU when they are all attached to a tree employing
>> additional 64bit index as a secondary treee key.
>
> Finding the CPU subtree and rb_next() wasn't good?
I implemented the approach you had suggested (as I understood it),
tested it and got results that are drastically different from what
I am getting for the tree of lists. Specifically I did:
1. keeping all groups in the same single tree by employing a 64-bit index
additionally to CPU key;
2. implementing special _less() function and rotation by re-inserting
group with incremented index;
3. replacing API with a callback in the signature by a macro
perf_event_groups_for_each();
Employing all that shrunk the total patch size, however I am still
struggling with the correctness issues.
Now I figured that not all indexed events are always located under
the root with the same cpu, and it depends on the order of insertion
e.g. with insertion order 01,02,03,14,15,16 we get this:
02
/ \
01 14
/ \
03 15
\
16
and it is unclear how to iterate cpu==0 part of tree in this case.
Iterating cpu specific subtree like this:
#define for_each_group_event(event, group, cpu, pmu, field) \
for (event = rb_entry_safe(group_first(group, cpu, pmu), \
typeof(*event), field); \
event && event->cpu == cpu && event->pmu == pmu; \
event = rb_entry_safe(rb_next(&event->field), \
typeof(*event), field))
misses event==03 for the case above and I guess this is where I loose
samples in my testing.
>
>
>