Re: [RFC PATCH] perf_counter: Add support for pinned and exclusivecounter groups

From: Ingo Molnar
Date: Wed Jan 14 2009 - 04:58:13 EST



* Paul Mackerras <paulus@xxxxxxxxx> wrote:

> Impact: New perf_counter features
>
> A pinned counter group is one that the user wants to have on the CPU
> whenever possible, i.e. whenever the associated task is running, for a
> per-task group, or always for a per-cpu group. If the system cannot
> satisfy that, it puts the group into an error state where it is not
> scheduled any more and reads from it return EOF (i.e. 0 bytes read).
> The group can be released from error state and made readable again using
> prctl(PR_TASK_PERF_COUNTERS_ENABLE). When we have finer-grained
> enable/disable controls on counters we'll be able to reset the error
> state on individual groups.

ok, that looks like quite sensible semantics.

> An exclusive group is one that the user wants to be the only group using
> the CPU performance monitor hardware whenever it is on. The counter
> group scheduler will not schedule an exclusive group if there are
> already other groups on the CPU and will not schedule other groups onto
> the CPU if there is an exclusive group scheduled (that statement does
> not apply to groups containing only software counters, which can always
> go on and which do not prevent an exclusive group from going on). With
> an exclusive group, we will be able to let users program PMU registers
> at a low level without the concern that those settings will perturb
> other measurements.

ok, this sounds good too. The disadvantage is the reduction in utility.
Actual applications and users will decide which variant is more useful in
practice - it does not complicate the design unreasonably.

> Along the way this reorganizes things a little:
> - is_software_counter() is moved to perf_counter.h.
> - cpuctx->active_oncpu now records the number of hardware counters on
> the CPU, i.e. it now excludes software counters. Nothing was reading
> cpuctx->active_oncpu before, so this change is harmless.

(btw., the percpu allocation code seems to have bitrotten a bit - the
logic around perf_reserved_percpu looks wrong and somewhat complicated.)

> - A new cpuctx->exclusive field records whether we currently have an
> exclusive group on the CPU.
> - counter_sched_out moves higher up in perf_counter.c and gets called
> from __perf_counter_remove_from_context and __perf_counter_exit_task,
> where we used to have essentially the same code.
> - __perf_counter_sched_in now goes through the counter list twice, doing
> the pinned counters in the first loop and the non-pinned counters in
> the second loop, in order to give the pinned counters the best chance
> to be scheduled in.

hm, i guess this could be improved: by queueing pinned counters in front
of the list and unpinned counters to the tail.

> Note that only a group leader can be exclusive or pinned, and that
> attribute applies to the whole group. This avoids some awkwardness in
> some corner cases (e.g. where a group leader is closed and the other
> group members get added to the context list). If we want to relax that
> restriction later, we can, and it is easier to relax a restriction than
> to apply a new one.

yeah, agreed.

> What this doesn't handle is when a pinned counter gets inherited and
> goes into error state in the child. The sensible thing would be to put
> the parent counter into error state in __perf_counter_exit_task, but
> that might mean taking it off the PMU on some other CPU, and I was
> nervous about doing anything substantial to parent_counter. Maybe what
> we need instead of an error value for counter->state is a separate error
> flag that can be set atomically by exiting children. BTW, I think we
> need an smp_wmb after updating parent_counter->count, so that when the
> parent sees the child has exited it is guaranteed to see the updated
> ->count value.

Yeah. Such artifacts at inheritance stem from the reduction in utility
that comes from any exclusive-resource-usage scheme, and are expected.

For example, right now it works just fine to nest 'timec' in itself [there
is a reduction in statistical value if we start round-robining, but
there's still full utility]. If it used exclusive or pinned counters that
might not work.

Again, which restrictions users/developers are more willing to live with
will be shown in actual usage of these facilities.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/