Re: [PATCH v3 1/n] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi

From: Alexey Budankov
Date: Mon Jun 19 2017 - 09:08:44 EST


On 16.06.2017 1:10, Alexey Budankov wrote:
Hi,

On 15.06.2017 22:56, Mark Rutland wrote:
Hi,

On Thu, Jun 15, 2017 at 08:41:42PM +0300, Alexey Budankov wrote:
This series of patches continues v2 and addresses captured comments.

As a general note, please keep any change log stuff out of the commit
message, and mvoe that just before the diff. It generally doesn't maek
sense for that to go in the commit message.

Accepted.


Specifically this patch replaces pinned_groups and flexible_groups
lists of perf_event_context by red-black cpu indexed trees avoiding
data structures duplication and introducing possibility to iterate
event groups for a specific CPU only.

If you use --per-thread, I take it the overhead is significantly
lowered?

Please ask more.


As another general thing, please aim to write the commit message so that
it'll make sense in N years' time, in the git history. Describe the
whole problem, and the intended solution.

I had to go diving into mail archives to understand the problem you're
trying to solve, and I shouldn't have to.

For example, this commit message could be something like:

perf/core: use rb trees for pinned/flexible groups

By default, the userspace perf tool opens per-cpu task-bound events
when sampling, so for N logical events requested by the user, the tool
will open N * NR_CPUS events.

In the kernel, we mux events with a hrtimer, periodically rotating the
event list and trying to schedule each in turn. We skip events whose
cpu filter doesn't match. So when we get unlucky, we can walk N *
(NR_CPUS - 1) events pointlessly for each hrtimer invocation.

This has been observed to result in significant overhead when running
the STREAM benchmark on 272 core Xeon Phi systems.

One way to avoid this is to place our events into an rb tree sorted by
CPU filter, so that our hrtimer can skip to the current CPU's
sub-tree, and ignore everything else.

As a step towards that, this patch replaces our event lists with rb
trees.
... which hopefully makes sense now, and will in 2, 5, 10 years' time,


Accepted. Thanks.


Signed-off-by: Alexey Budankov <alexey.budankov@xxxxxxxxxxxxxxx>

Have you thrown Vince's perf fuzzer at this?

If you haven't, please do. It can be found in the fuzzer directory of:

https://github.com/deater/perf_event_tests


Accepted.

I run the test suite and it revealed no additional regressions in comparison to what I have on the clean kernel.

However the fuzzer constantly reports some strange stacks that are not seen on the clean kernel and I have no idea how that might be caused by the patches.


---
include/linux/perf_event.h | 34 +++-
kernel/events/core.c | 386
+++++++++++++++++++++++++++++++++------------
2 files changed, 315 insertions(+), 105 deletions(-)

<changelog goes here>


diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 24a6358..2c1dcf1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -574,6 +574,27 @@ struct perf_event {
struct list_head sibling_list;

/*
+ * Node on the pinned or flexible tree located at the event context;
+ * the node may be empty in case its event is not directly attached
+ * to the tree but to group_list list of the event directly
+ * attached to the tree;
+ */
+ struct rb_node group_node;
+ /*
+ * List keeps groups allocated for the same cpu;
+ * the list may be empty in case its event is not directly
+ * attached to the tree but to group_list list of the event directly
+ * attached to the tree;
+ */
+ struct list_head group_list;
+ /*
+ * Entry into the group_list list above;
+ * the entry may be attached to the self group_list list above
+ * in case the event is directly attached to the pinned or
+ * flexible tree;
+ */
+ struct list_head group_list_entry;

We already have group_entry for threading the RB tree. i don't see why
we need two new lists heads.

Ok. For a group leader event its group_entry may be used instead of new group_list_entry.


+ /*
* We need storage to track the entries in perf_pmu_migrate_context; we
* cannot use the event_entry because of RCU and we want to keep the
* group in tact which avoids us using the other two entries.
@@ -741,8 +762,17 @@ struct perf_event_context {
struct mutex mutex;

struct list_head active_ctx_list;
- struct list_head pinned_groups;
- struct list_head flexible_groups;
+ /*
+ * RB tree for pinned groups; keeps event's group_node
+ * nodes of attached pinned groups;
+ */
+ struct rb_root pinned_tree;
+ /*
+ * RB tree for flexible groups; keeps event's group_node
+ * nodes of attached flexible groups;
+ */
+ struct rb_root flexible_tree;

We didn't need these commesnts before, and I don't think we need them
now.

Any reason not to stick with the existing names?

The existing {pinned,flexible}_groups names are fine regardless of
whether a list or tree is used, and makes it clear that the elements are
the group leaders, not all events.

Yes, makes sense. Just changing the type of data structures.


+
struct list_head event_list;
int nr_events;
int nr_active;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bc63f8d..a3531f9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1458,13 +1458,142 @@ static enum event_type_t
get_event_type(struct perf_event *event)
return event_type;
}

-static struct list_head *
-ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
+static struct rb_root *
+ctx_group_cpu_tree(struct perf_event *event, struct
perf_event_context *ctx)
{
if (event->attr.pinned)
- return &ctx->pinned_groups;
+ return &ctx->pinned_tree;
else
- return &ctx->flexible_groups;
+ return &ctx->flexible_tree;
+}
+
+/*
+ * Insert a group into a tree using event->cpu as a key. If event->cpu node
+ * is already attached to the tree then the event is added to the attached
+ * group's group_list list.
+ */
+static void
+perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
+{
+ struct rb_node **node;
+ struct rb_node *parent;
+
+ WARN_ON_ONCE(!tree || !event);
+
+ node = &tree->rb_node;
+ parent = *node;

The first iteration of the loop handles this, so it can go.

If tree is empty parent will be uninitialized what is harmful.


+
+ while (*node) {
+ struct perf_event *node_event = container_of(*node,
+ struct perf_event, group_node);
+
+ parent = *node;
+
+ if (event->cpu < node_event->cpu) {
+ node = &parent->rb_left;
+ } else if (event->cpu > node_event->cpu) {
+ node = &parent->rb_right;
+ } else {
+ list_add_tail(&event->group_list_entry,
+ &node_event->group_list);
+ return;

Why aren't we putting all group leaders into the tree?

Because we index a tree by cpu only and if two groups share cpu we link them into the group_list of the group directly attached to the tree via group_node.


I don't think this is what Peter meant when he suggested a threaded rb
tree.

Yes, that's right. This implements cpu indexed rb trees with linked lists attached to every tree's node. The lists contain event groups allocated for the same cpu only.


My understanding was that every group leader should be in the rb tree,
and every group leader should be in the same list that threads the whole
tree. Both the rb tree and list have to be maintained with the same
order.

Thus, you can walk the rb tree to find the first event in the sub-list
that you care about, then walk that sub-list.

Note that this means we need to shuffle the list and rb tree to rotate
events in each sub-tree.

Now we also rotate a list but only containing groups for the appropriate cpu and we don't touch the tree.


+ }
+ }
+
+ list_add_tail(&event->group_list_entry, &event->group_list);
+
+ rb_link_node(&event->group_node, parent, node);
+ rb_insert_color(&event->group_node, tree);
+}
+
+/*
+ * Delete a group from a tree. If the group is directly attached to
the tree
+ * it also detaches all groups on the group's group_list list.
+ */
+static void
+perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event)
+{
+ WARN_ON_ONCE(!tree || !event);
+
+ if (RB_EMPTY_NODE(&event->group_node)) {
+ list_del_init(&event->group_list_entry);
+ } else {
+ struct perf_event *list_event, *list_next;
+
+ rb_erase(&event->group_node, tree);
+ list_for_each_entry_safe(list_event, list_next,
+ &event->group_list, group_list_entry)
+ list_del_init(&list_event->group_list_entry);

At this point, all the other group leaders withthe same filter are gone
from the rb tree, since we didn't insert any of them into the rb tree in
place of the leader we deleted.
>> + }
+}
+
+/*
+ * Find group_list list by a cpu key and call provided callback for every
+ * group on the list. Iteration stops if the callback returns non zero.
+ */
+
+typedef int (*perf_cpu_tree_callback_t)(struct perf_event *, void *);
+
+static int
+perf_cpu_tree_iterate_cpu(struct rb_root *tree, int cpu,
+ perf_cpu_tree_callback_t callback, void *data)
+{
+ int ret = 0;
+ struct rb_node *node;
+ struct perf_event *event;
+
+ WARN_ON_ONCE(!tree);
+
+ node = tree->rb_node;
+
+ while (node) {
+ struct perf_event *node_event = container_of(node,
+ struct perf_event, group_node);
+
+ if (cpu < node_event->cpu) {
+ node = node->rb_left;
+ } else if (cpu > node_event->cpu) {
+ node = node->rb_right;
+ } else {
+ list_for_each_entry(event, &node_event->group_list,
+ group_list_entry) {
+ ret = callback(event, data);
+ if (ret)
+ return ret;
+ }
+ }
+ }
+
+ return 0;
+}
+
+/*
+ * Iterate a tree and call provided callback for every group in the tree.
+ * Iteration stops if the callback returns non zero.
+ */
+static int
+perf_cpu_tree_iterate(struct rb_root *tree,
+ perf_cpu_tree_callback_t callback, void *data)
+{
+ int ret = 0;
+ struct rb_node *node;
+ struct perf_event *event;
+
+ WARN_ON_ONCE(!tree);
+
+ for (node = rb_first(tree); node; node = rb_next(node)) {
+ struct perf_event *node_event = container_of(node,
+ struct perf_event, group_node);
+
+ list_for_each_entry(event, &node_event->group_list,
+ group_list_entry) {
+ ret = callback(event, data);
+ if (ret)
+ return ret;
+ }
+ }
+
+ return 0;
}

If you need to iterate over every event, you can use the list that
threads the whole tree.


/*
@@ -1485,12 +1614,12 @@ list_add_event(struct perf_event *event,
struct perf_event_context *ctx)
* perf_group_detach can, at all times, locate all siblings.
*/
if (event->group_leader == event) {
- struct list_head *list;
+ struct rb_root *tree;

event->group_caps = event->event_caps;

- list = ctx_group_list(event, ctx);
- list_add_tail(&event->group_entry, list);
+ tree = ctx_group_cpu_tree(event, ctx);
+ perf_cpu_tree_insert(tree, event);

Can we wrap this in a helper so that finds the sub-tree and performs
the insert?

Ok. add_to_groups().


}

list_update_cgroup_event(event, ctx, true);
@@ -1680,8 +1809,12 @@ list_del_event(struct perf_event *event,
struct perf_event_context *ctx)

list_del_rcu(&event->event_entry);

- if (event->group_leader == event)
- list_del_init(&event->group_entry);
+ if (event->group_leader == event) {
+ struct rb_root *tree;
+
+ tree = ctx_group_cpu_tree(event, ctx);
+ perf_cpu_tree_delete(tree, event);

... likewise?

Ok. del_from_groups().


Thanks,
Mark.


Thanks,
Alexey