Re: [RESEND PATCH v3 1/n] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi

From: Alexey Budankov
Date: Wed Jun 21 2017 - 05:25:21 EST


On 20.06.2017 18:12, Alexey Budankov wrote:
On 20.06.2017 16:44, Mark Rutland wrote:
On Fri, Jun 16, 2017 at 02:03:58AM +0300, Alexey Budankov wrote:
perf/core: use rb trees for pinned/flexible groups

By default, the userspace perf tool opens per-cpu task-bound events
when sampling, so for N logical events requested by the user, the tool
will open N * NR_CPUS events.

In the kernel, we mux events with a hrtimer, periodically rotating the
flexible group list and trying to schedule each group in turn. We
skip groups whose cpu filter doesn't match. So when we get unlucky,
we can walk N * (NR_CPUS - 1) groups pointlessly for each hrtimer
invocation.

This has been observed to result in significant overhead when running
the STREAM benchmark on 272 core Xeon Phi systems.

One way to avoid this is to place our events into an rb tree sorted by
CPU filter, so that our hrtimer can skip to the current CPU's
list and ignore everything else.

As a step towards that, this patch replaces event group lists with rb
trees.

Signed-off-by: Alexey Budankov <alexey.budankov@xxxxxxxxxxxxxxx>
---
include/linux/perf_event.h | 18 ++-
kernel/events/core.c | 393
+++++++++++++++++++++++++++++++++------------
2 files changed, 307 insertions(+), 104 deletions(-)

Addressed Mark Rutland's comments from the previous patch version.

... then this should be v4, no?

Which comments? Could you pelase write a changelog in future?

Changes are:

1. changed type of pinned_groups/flexible_groups to rb_tree;
2. removed group_list_entry and reused group_entry for that purposes;
3. added add_to_groups()/del_from_groups() helper functions;


In future, please send patches as a series, with a known upper-bound
rather than N. It's really painful to find them when they're sent
separately, without a known upper bound.

Accepted.


[...]

+/*
+ * Delete a group from a tree. If the group is directly attached to
the tree
+ * it also detaches all groups on the group's group_list list.
+ */
+static void
+perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event)
+{
+ WARN_ON_ONCE(!tree || !event);
+
+ if (RB_EMPTY_NODE(&event->group_node)) {
+ list_del_init(&event->group_entry);
+ } else {
+ struct perf_event *list_event, *list_next;
+
+ rb_erase(&event->group_node, tree);
+ list_for_each_entry_safe(list_event, list_next,
+ &event->group_list, group_entry)
+ list_del_init(&list_event->group_entry);
+ }
+}

As I commented on the last version, this means that all groups which
were (incidentally) hanging off of this one are removed, and can
no longer be reached via the tree.

Surely one of the remaining groups should be added to the tree?

Aww, I see. That needs to implemented. Thanks.

Addressing that inconsistency like this:

static void
perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event)
{
struct perf_event *next;

WARN_ON_ONCE(!tree || !event);

list_del_init(&event->group_entry);

if (!RB_EMPTY_NODE(&event->group_node)) {
if (!list_empty(&event->group_list)) {
next = list_first_entry(&event->group_list,
struct perf_event, group_entry);
list_replace_init(&event->group_list,
&next->group_list);
rb_replace_node(&event->group_node,
&next->group_node, tree);
} else {
rb_erase(&event->group_node, tree);
}
RB_CLEAR_NODE(&event->group_node);
}
}

solves list_del corruption:

list_del corruption. prev->next should be ffff88c2c4654010, but was
ffff88c31eb0c020
[ 607.632813] ------------[ cut here ]------------
[ 607.632816] kernel BUG at lib/list_debug.c:53!

[ 607.635531] Call Trace:
[ 607.635583] list_del_event+0x1d7/0x210

and x86_pmu_start warning:

[ 484.804737] WARNING: CPU: 15 PID: 31168 at arch/x86/events/core.c:1257 x86_pmu_start+0x8f/0xb0

[ 484.804938] RIP: 0010:x86_pmu_start+0x8f/0xb0
[ 484.804971] Call Trace:
[ 484.804976] <IRQ>
[ 484.804984] x86_pmu_enable+0x27f/0x2f0



I don't beleive that is correct.

I beleive it would be simpler to reason about a threaded rb-tree here,
since that special case wouldn't exist.

Thanks,
Mark.