Re: [RFC PATCH] perf_counter: Add counter enable/disable ioctls

From: Ingo Molnar
Date: Fri Jan 16 2009 - 07:17:21 EST



* Paul Mackerras <paulus@xxxxxxxxx> wrote:

> Ingo Molnar writes:
>
> > Okay - please let me know when i can pull it - the extension of counter
> > ops to the 'family' looks nice and desirable.
>
> Thanks - I'll try to do some basic testing this weekend. I won't get
> a chance to do any extensive testing for the next two weeks because
> I'll be on vacation.
>
> > I'm wondering, should it also extend to all members of a group perhaps -
> > or should that be separate ioctls?
> >
> > It makes sense to individually enable/disable a single counter within a
> > group - including the group leader. So i guess we need 3 variants:
> >
> > - disable/enable single counter
> > - disable/enable group
> > - disable/enable family
> >
> > We might as well just offer a single variant initially: 'disable/enable
> > everything that can be iterated from that counter', and provide more
> > specialized variants only once the specific need arises?
>
> Well, each member of a group has its own family. I don't see any need
> to enable/disable the top-level (parent) counter separately from its
> children, but as you say, if the need arises we can add that.
>
> We can enable/disable individual subsidiary group members (and their
> families) since we have an fd for each member. We would only need
> separate 'single counter' and 'group' ioctls for the group leader, and
> once again we could add that if the need arises.
>
> Currently the ioctls on the group leader disable/enable the whole group.
> I'm not sure that enabling/disabling individual counters in the group
> will really be all that useful, but I put it in for subsidiary counters
> since it falls out pretty easily and naturally in the code. If users
> want to disable/enable each counter in a group individually, they could
> add an extra dummy counter (say a cpu_clock counter) to be the group
> leader and then just mostly ignore it.
>
> So I think that boils down to saying that what I have implemented is the
> single variant that you propose in your last quoted paragraph above. :)

Okay :-) I missed the detail that it iterates down groups as well, if done
on the leader.

> > > I have to admit to a small element of cargo-cult programming in
> > > __perf_counter_enable and __perf_counter_disable: I put calls to
> > > curr_rq_lock_irq_save/restore in there because I was following the
> > > model of __perf_install_in_context, but I don't know what they do
> > > (beyond disabling/restoring interrupts) or why they are needed.
> > > What are they there for?
> >
> > That's an ugly detail: we can only observe and update the current task
> > clock with the runqueue lock held. So right now the counter ops are
> > structured so that we first take the rq lock (which implies irq
> > disable as well), and then all sw counters can assume that the rq lock
> > is held.
>
> OK. Is there any advantage of the task_clock over the cpu_clock? Could
> we get rid of task_clock (and therefore remove the curr_rq_lock_* calls)
> and just use the cpu_clock?

the cpu_clock() uses the rq lock just as much - it's the same basic
scheduler time they are using.

We could drop the rq clock, but i found that the timec output gets a bit
ugly that way, if used with multiple sw counters.

Instead of getting neat output like:

$ timec -e -2,-2,-2,-2 ls

[...]

Performance counter stats for 'ls':

2.830582 task clock ticks (msecs)

2.830582 task clock ticks (msecs)
2.830582 task clock ticks (msecs)
2.830582 task clock ticks (msecs)
2.830582 task clock ticks (msecs)

Wall-clock time elapsed: 46.353323 msecs

We get something like (mockup, from memory):

Performance counter stats for 'ls':

2.830431 task clock ticks (msecs)

2.830582 task clock ticks (msecs)
2.830613 task clock ticks (msecs)
2.830781 task clock ticks (msecs)
2.831100 task clock ticks (msecs)

Wall-clock time elapsed: 46.353323 msecs

As each individual counter is sampled differently.

But i'd like to drop the rq lock in any case - it's not really needed from
a scheduling POV. Maybe we could take the current sched time once, save it
and reuse that in the sw counters (instead of each sw counter taking that
time individuall)?

> Also, since we know we're only taking differences between readings on
> the same cpu, could we use sched_clock() instead of cpu_clock() to
> reduce overhead? (We certainly could on powerpc, since we have
> synchronized timebase registers, but I know you don't have that luxury
> on x86. :)

there's an evil plan behind my use of cpu_clock() / sum_exec_runtime: that
way if the scheduler folks break those metrics the performance analysis
folks immediately notify us of our stupidity ;-) It's also the metrics
that 'top' and 'time' use - so it's a bit more natural to use - these
things should work together and break together. The scheduler's deadline
logic also uses this metric so there's synergy.

I think if we do the time-sampling approach i suggested above the locking
and overhead problem disappears and we can just use cpu_clock() or
sum_exec_runtime once, and use it from the sw counter(s)?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/