Re: [RFC 1/6] perf/core: create active and inactive event groups

From: Mark Rutland
Date: Thu Jan 12 2017 - 06:06:13 EST


On Tue, Jan 10, 2017 at 12:45:31PM -0800, David Carrillo-Cisneros wrote:
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index 4741ecdb9817..3fa18f05c9b0 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -573,6 +573,7 @@ struct perf_event {
> >>
> >> struct hlist_node hlist_entry;
> >> struct list_head active_entry;
> >> + struct list_head ctx_active_entry;
> >
> > I think we should be able to kill off active_entry as part of this
> > series; it's there to do the same thing (optimize iteration over active
> > events).
> >
> > If we expose a for_each_ctx_active_event() helper which iterates of the
> > pinned and flexible lists, I think we may be able to migrate existing
> > users over and kill off perf_event::active_entry, and the redundant list
> > manipulation in drivers.
>
> The problem with that would be iterating over all CPU contexts, when most
> users of active_entry only install evens in one CPU per package/socket.
>
> Maybe we can create yet another list of cpu contexts to have contexts with
> at least one active event.

Ah. I thought these were already per-context rather than per-{pmu,box}.
My bad. I guess having bth isn't really a problem.

In all cases they're used by a hrtimer that iterates over events to
update them (and they're only used by uncore PMUs).

We could instead register a hrtimer affine to a cpu when the first event
was created, which would iterate over that CPU's events alone. When you
only have events on one CPU, that behaves the same. When you have events
on multiple CPUs, you avoid repeated cross-calls to update the events.

> >> @@ -1851,6 +1877,11 @@ group_sched_out(struct perf_event *group_event,
> >>
> >> if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
> >> cpuctx->exclusive = 0;
> >> +
> >> + if (group_event->state <= PERF_EVENT_STATE_INACTIVE)
> >> + ctx_sched_groups_to_inactive(group_event, ctx);
> >
> > Was this intended to be '==' ?
>
> It's '<=' so that the event is removed from the rb-tree even if it
> went to ERROR state between the last group_sched_in and this
> group_sched_out

Ok.

> > As-is, this looks inconsistent with the WARN_ON() in
> > ctx_sched_groups_to_inactive() ...
>
> Yes, that WARN_ON is likely wrong ...
>
> >> + if (group_event->state < PERF_EVENT_STATE_INACTIVE)
> >> + ctx_sched_groups_del(group_event, ctx);
> >
> > ... and here we'll subsequently delete most events from the inactive
> > list, rather than never adding them to the inactive list in the first
> > place.
>
> Yeah, that's not right. I'll review this.

Thanks!

> >> @@ -2091,9 +2135,7 @@ group_sched_in(struct perf_event *group_event,
> >> u64 now = ctx->time;
> >> bool simulate = false;
> >>
> >> - if (group_event->state == PERF_EVENT_STATE_OFF)
> >> - return 0;
> >> -
> >> + WARN_ON(group_event->state != PERF_EVENT_STATE_INACTIVE);
> >> pmu->start_txn(pmu, PERF_PMU_TXN_ADD);
> >>
> >> if (event_sched_in(group_event, cpuctx, ctx)) {
> >> @@ -2112,9 +2154,10 @@ group_sched_in(struct perf_event *group_event,
> >> }
> >> }
> >>
> >> - if (!pmu->commit_txn(pmu))
> >> + if (!pmu->commit_txn(pmu)) {
> >> + ctx_sched_groups_to_active(group_event, ctx);
> >> return 0;
> >
> > I think IRQs are disabled in this path (though I'll need to
> > double-check), but I don't think the PMU is disabled, so I believe a PMI
> > can come in between the commit_txn() and the addition of events to their
> > active list.
> >
> > I'm not immediately sure if that matters -- we'll need to consider what
> > list manipulation might happen in a PMI handler.
> >
> > If it does matter, we could always add the events to an active list
> > first, then try the commit, then remove them if the commit failed. It
> > means we might see some not-actually-active events in the active lists
> > occasionally, but the lists would still be shorter than the full event
> > list.
>
> Just checked, the pmu is disabled from perf_event_context_sched_in, so
> it should be fine.

Unfortunately, that's not sufficient in all cases. Details below for the
'enjoyment' of all...

In some cases, several PMUs share the same task context, so events in a
context might be for a HW PMU other than ctx->pmu. Thus, the
perf_pmu_disable(ctx->pmu) in perf_event_context_sched_in() isn't
necessarily sufficient to disable the PMU that group_sched_in() is
operating on. AFAICT, the PMU is only necessarily disabled in the blcok
from start_txn() to {stop,cancel}_txn().

The reason for this is that each task can only have two contexts, a
perf_sw_context, and a perf_hw_context. Despite this, more than two HW
PMUs may have task-bound events, so they have to share.

For example, in systems with heterogeneous CPUs (like ARM's big.LITTLE),
we have a PMU per micro-architecture present in the system, and a user
may wish to open events for either microarchitecture. Those share the
same perf_hw_context.

For more fun, some HW tracing PMUs use perf_sw_context, which is very
counter-intuitive.

If any of those PMUs sharing a context can raise an NMI, then I believe
the problem I raised above stands. Currently the ARM PMU don't, but
there's ongoing work to implement NMIs for GICv3. I'm not sure about the
other PMUs.

This context sharing is problematic for a number of reasons outside of
scheduling. I had hoped to address this by reworking the way we handle
task contexts to avoid sharing. It turns out that there are many subtle
locking and ordering issues to consider for that, so I hadn't made much
headway with that approach.

Thanks,
Mark.