Re: [RFC][PATCH] perf: Rewrite core context handling

From: Peter Zijlstra
Date: Tue Oct 16 2018 - 05:33:12 EST


On Mon, Oct 15, 2018 at 11:31:24AM -0700, Stephane Eranian wrote:

> I have always had a hard time understanding the role of all these
> structs in the generic code. This is still very confusing and very
> hard to follow.
>
> In my mind, you have per-task and per-cpu perf_events contexts. And
> for each you can have multiple PMUs, some hw some sw. Each PMU has
> its own list of events maintained in RB tree. There is never any
> interactions between PMUs.

That is more or less how it was. We have per PMU task or CPU contexts:


task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu <--'
v ^
perf_event ------'


Each task has an array of pointers to a perf_event_context. Each
perf_event_context has a direct relation to a PMU and a group of events
for that PMU. The task related perf_event_context's have a pointer back
to that task.

Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
includes a perf_event_context, which again has a direct relation to that
PMU, and a group of events for that PMU.

The perf_cpu_context also tracks which task context is currently
associated with that CPU and includes a few other things like the
hrtimer for rotation etc..

Each perf_event is then associated with its PMU and one
perf_event_context.

> Maybe this is how this is done or proposed by your patches, but it
> certainly is not obvious.

No, my patch somewhat completely wrecks the above; and reduces to a
single task context and a single CPU context.

There were a number of problems with the above. One is that task-array
of pointer, which limited the number of task contexts we could have.

Now, we could've easily changed that to a list and called it a day.
That is not in fact a horribly difficult patch. If you combine that with
a patch that actually freed task context's when they go empty, that
might actually work.

But there are a number of other considerations that resulted in the
patch as presented:

- there is a bunch of per context state that is simply duplicated
between contexts, like for instance the time keeping. There is no
point in tracking the time for 'n' per task/cpu contexts when in fact
they're all the same.

- on context switch we have to iterate all these 'n' contexts and
switch them one by one. Instead of just switching one context and
calling it a day.

- for big.little we'd end up with 2 per-task contexts and only ever use
1 at any one time, which increases 'n' in the above cases for no
purpose.

- the actual per-pmu-per-context state is very small (as I think Alexey
already implied).

- a single context simplifies a bunch of things; including the
move_group case (we no longer have to adjust perf_event::ctx) and the
cpu-online tests and the ctx locking and it removes a bunch of
context lists (like active_ctx_list).

So a single context is what I went with. That all results in:


task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------------' | |
| `--> perf_event_pmu_context
| ^ ^
| | |
| ,-----' v
| | perf_cpu_pmu_context
| | ^
| | |
v v v
perf_event ---> pmu


Because while the per-pmu-per-context state is small, it does exists,
this gives rise to perf_event_pmu_context. It tracks nr_events and
nr_active, which is used to (quickly) tell if rotation is required (it
is possible to reduce this state I think, but I've not yet gotten it
down to 0). It also tracks which events are actually active; iterating a
list is cheaper than finding them all in the RB-tree.

It also contains the task_ctx_data thing for LBR, which is a PMU
specific extra data thingy.

We then also keep a list of (active) perf_event_pmu_context in
perf_event_context, such that we can quickly find which PMUs are in fact
involved with the context. This simplifies context scheduling a little.

We then also need per-pmu-per-cpu state, which gives rise to
perf_cpu_pmu_context, and that mostly includes bits to drive the event
rotation, which per ABI is per PMU, but it also includes bits to do
perf_event_attr::exclusive scheduling, which is also naturally
per-pmu-per-cpu.

And yes, the above looks more complicated, but at the same time, a bunch
of things did get simplified. Maybe once the dust settles someone can
turn this here email into a sensible comment or something ;-)

> Also the Intel LBR is not a PMU on is own. Maybe you are talking about
> the BTS in arch/x86/even/sintel/bts.c.

This thing:

https://lkml.kernel.org/r/1510970046-25387-1-git-send-email-megha.dey@xxxxxxxxxxxxxxx