Re: [RFC] perf_events: support for uncore a.k.a. nest units

From: stephane eranian
Date: Wed Apr 21 2010 - 04:45:08 EST


On Wed, Apr 21, 2010 at 10:39 AM, Lin Ming <ming.m.lin@xxxxxxxxx> wrote:
> On Wed, 2010-04-21 at 16:32 +0800, stephane eranian wrote:
>> Seems to me that struct pmu is a shared resource across all CPUs.
>> I don't understand why scheduling on one CPU would have to impact
>> all the other CPUs, unless I am missing something here.
>
> Do you mean the pmu->flag?

Yes.

> You are right, pmu->flag should be per cpu data.
>
> Will update the patch.
>
> Thanks,
> Lin Ming
>
>>
>>
>> On Wed, Apr 21, 2010 at 10:08 AM, Lin Ming <ming.m.lin@xxxxxxxxx> wrote:
>> > On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote:
>> >> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:
>> >>
>> >> > > One thing not on that list, which should happen first I guess, is to
>> >> > > remove hw_perf_group_sched_in(). The idea is to add some sort of
>> >> > > transactional API to the struct pmu, so that we can delay the
>> >> > > schedulability check until commit time (and roll back when it fails).
>> >> > >
>> >> > > Something as simple as:
>> >> > >
>> >> > > Â struct pmu {
>> >> > > Â Â void start_txn(struct pmu *);
>> >> > > Â Â void commit_txn(struct pmu *);
>> >> > >
>> >> > > Â Â ,,,
>> >> > > Â };
>> >> >
>> >> > Could you please explain a bit more?
>> >> >
>> >> > Does it mean that "start_txn" perform the schedule events stuff
>> >> > and "commit_txn" perform the assign events stuff?
>> >> >
>> >> > Does "commit time" mean the actual activation in hw_perf_enable?
>> >>
>> >> No, the idea behind hw_perf_group_sched_in() is to not perform
>> >> schedulability tests on each event in the group, but to add the group as
>> >> a whole and then perform one test.
>> >>
>> >> Of course, when that test fails, you'll have to roll-back the whole
>> >> group again.
>> >>
>> >> So start_txn (or a better name) would simply toggle a flag in the pmu
>> >> implementation that will make pmu::enable() not perform the
>> >> schedulablilty test.
>> >>
>> >> Then commit_txn() will perform the schedulability test (so note the
>> >> method has to have a !void return value, my mistake in the earlier
>> >> email).
>> >>
>> >> This will allow us to use the regular
>> >> kernel/perf_event.c::group_sched_in() and all the rollback code.
>> >> Currently each hw_perf_group_sched_in() implementation duplicates all
>> >> the rolllback code (with various bugs).
>> >>
>> >>
>> >>
>> >> We must get rid of all weak hw_perf_*() functions before we can properly
>> >> consider multiple struct pmu implementations.
>> >>
>> >
>> > Thanks for the clear explanation.
>> >
>> > Does below patch show what you mean?
>> >
>> > I only touch the x86 arch code now.
>> >
>> > ---
>> > Âarch/x86/kernel/cpu/perf_event.c | Â161 +++++++++++--------------------------
>> > Âinclude/linux/perf_event.h    |  10 ++-
>> > Âkernel/perf_event.c       Â|  28 +++----
>> > Â3 files changed, 67 insertions(+), 132 deletions(-)
>> >
>> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
>> > index 626154a..62aa9a1 100644
>> > --- a/arch/x86/kernel/cpu/perf_event.c
>> > +++ b/arch/x86/kernel/cpu/perf_event.c
>> > @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event)
>> > Â Â Â Âif (n < 0)
>> > Â Â Â Â Â Â Â Âreturn n;
>> >
>> > + Â Â Â if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED))
>> > + Â Â Â Â Â Â Â goto out;
>> > +
>> > Â Â Â Âret = x86_pmu.schedule_events(cpuc, n, assign);
>> > Â Â Â Âif (ret)
>> > Â Â Â Â Â Â Â Âreturn ret;
>> > @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event)
>> > Â Â Â Â */
>> > Â Â Â Âmemcpy(cpuc->assign, assign, n*sizeof(int));
>> >
>> > +out:
>> > Â Â Â Âcpuc->n_events = n;
>> > Â Â Â Âcpuc->n_added += n - n0;
>> >
>> > @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
>> > Â Â Â Âreturn &unconstrained;
>> > Â}
>> >
>> > -static int x86_event_sched_in(struct perf_event *event,
>> > - Â Â Â Â Â Â Â Â Â Â Â Â struct perf_cpu_context *cpuctx)
>> > -{
>> > - Â Â Â int ret = 0;
>> > -
>> > - Â Â Â event->state = PERF_EVENT_STATE_ACTIVE;
>> > - Â Â Â event->oncpu = smp_processor_id();
>> > - Â Â Â event->tstamp_running += event->ctx->time - event->tstamp_stopped;
>> > -
>> > - Â Â Â if (!is_x86_event(event))
>> > - Â Â Â Â Â Â Â ret = event->pmu->enable(event);
>> > -
>> > - Â Â Â if (!ret && !is_software_event(event))
>> > - Â Â Â Â Â Â Â cpuctx->active_oncpu++;
>> > -
>> > - Â Â Â if (!ret && event->attr.exclusive)
>> > - Â Â Â Â Â Â Â cpuctx->exclusive = 1;
>> > -
>> > - Â Â Â return ret;
>> > -}
>> > -
>> > -static void x86_event_sched_out(struct perf_event *event,
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â struct perf_cpu_context *cpuctx)
>> > -{
>> > - Â Â Â event->state = PERF_EVENT_STATE_INACTIVE;
>> > - Â Â Â event->oncpu = -1;
>> > -
>> > - Â Â Â if (!is_x86_event(event))
>> > - Â Â Â Â Â Â Â event->pmu->disable(event);
>> > -
>> > - Â Â Â event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
>> > -
>> > - Â Â Â if (!is_software_event(event))
>> > - Â Â Â Â Â Â Â cpuctx->active_oncpu--;
>> > -
>> > - Â Â Â if (event->attr.exclusive || !cpuctx->active_oncpu)
>> > - Â Â Â Â Â Â Â cpuctx->exclusive = 0;
>> > -}
>> > -
>> > -/*
>> > - * Called to enable a whole group of events.
>> > - * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
>> > - * Assumes the caller has disabled interrupts and has
>> > - * frozen the PMU with hw_perf_save_disable.
>> > - *
>> > - * called with PMU disabled. If successful and return value 1,
>> > - * then guaranteed to call perf_enable() and hw_perf_enable()
>> > - */
>> > -int hw_perf_group_sched_in(struct perf_event *leader,
>> > - Â Â Â Â Â Â Âstruct perf_cpu_context *cpuctx,
>> > - Â Â Â Â Â Â Âstruct perf_event_context *ctx)
>> > -{
>> > - Â Â Â struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>> > - Â Â Â struct perf_event *sub;
>> > - Â Â Â int assign[X86_PMC_IDX_MAX];
>> > - Â Â Â int n0, n1, ret;
>> > -
>> > - Â Â Â if (!x86_pmu_initialized())
>> > - Â Â Â Â Â Â Â return 0;
>> > -
>> > - Â Â Â /* n0 = total number of events */
>> > - Â Â Â n0 = collect_events(cpuc, leader, true);
>> > - Â Â Â if (n0 < 0)
>> > - Â Â Â Â Â Â Â return n0;
>> > -
>> > - Â Â Â ret = x86_pmu.schedule_events(cpuc, n0, assign);
>> > - Â Â Â if (ret)
>> > - Â Â Â Â Â Â Â return ret;
>> > -
>> > - Â Â Â ret = x86_event_sched_in(leader, cpuctx);
>> > - Â Â Â if (ret)
>> > - Â Â Â Â Â Â Â return ret;
>> > -
>> > - Â Â Â n1 = 1;
>> > - Â Â Â list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>> > - Â Â Â Â Â Â Â if (sub->state > PERF_EVENT_STATE_OFF) {
>> > - Â Â Â Â Â Â Â Â Â Â Â ret = x86_event_sched_in(sub, cpuctx);
>> > - Â Â Â Â Â Â Â Â Â Â Â if (ret)
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â goto undo;
>> > - Â Â Â Â Â Â Â Â Â Â Â ++n1;
>> > - Â Â Â Â Â Â Â }
>> > - Â Â Â }
>> > - Â Â Â /*
>> > - Â Â Â Â* copy new assignment, now we know it is possible
>> > - Â Â Â Â* will be used by hw_perf_enable()
>> > - Â Â Â Â*/
>> > - Â Â Â memcpy(cpuc->assign, assign, n0*sizeof(int));
>> > -
>> > - Â Â Â cpuc->n_events Â= n0;
>> > - Â Â Â cpuc->n_added Â+= n1;
>> > - Â Â Â ctx->nr_active += n1;
>> > -
>> > - Â Â Â /*
>> > - Â Â Â Â* 1 means successful and events are active
>> > - Â Â Â Â* This is not quite true because we defer
>> > - Â Â Â Â* actual activation until hw_perf_enable() but
>> > - Â Â Â Â* this way we* ensure caller won't try to enable
>> > - Â Â Â Â* individual events
>> > - Â Â Â Â*/
>> > - Â Â Â return 1;
>> > -undo:
>> > - Â Â Â x86_event_sched_out(leader, cpuctx);
>> > - Â Â Â n0 Â= 1;
>> > - Â Â Â list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>> > - Â Â Â Â Â Â Â if (sub->state == PERF_EVENT_STATE_ACTIVE) {
>> > - Â Â Â Â Â Â Â Â Â Â Â x86_event_sched_out(sub, cpuctx);
>> > - Â Â Â Â Â Â Â Â Â Â Â if (++n0 == n1)
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â break;
>> > - Â Â Â Â Â Â Â }
>> > - Â Â Â }
>> > - Â Â Â return ret;
>> > -}
>> > -
>> > Â#include "perf_event_amd.c"
>> > Â#include "perf_event_p6.c"
>> > Â#include "perf_event_p4.c"
>> > @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event)
>> > Â Â Â Âx86_perf_event_update(event);
>> > Â}
>> >
>> > +/*
>> > + * Set the flag to make pmu::enable() not perform the
>> > + * schedulablilty test.
>> > + */
>> > +static void x86_pmu_start_txn(struct pmu *pmu)
>> > +{
>> > + Â Â Â pmu->flag |= PERF_EVENT_TRAN_STARTED;
>> > +}
>> > +
>> > +static void x86_pmu_stop_txn(struct pmu *pmu)
>> > +{
>> > + Â Â Â pmu->flag &= ~PERF_EVENT_TRAN_STARTED;
>> > +}
>> > +
>> > +/*
>> > + * Return 0 if commit transaction success
>> > + */
>> > +static int x86_pmu_commit_txn(struct pmu *pmu)
>> > +{
>> > + Â Â Â struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>> > + Â Â Â int assign[X86_PMC_IDX_MAX];
>> > + Â Â Â int n, ret;
>> > +
>> > + Â Â Â n = cpuc->n_events;
>> > +
>> > + Â Â Â if (!x86_pmu_initialized())
>> > + Â Â Â Â Â Â Â return -EAGAIN;
>> > +
>> > + Â Â Â ret = x86_pmu.schedule_events(cpuc, n, assign);
>> > + Â Â Â if (ret)
>> > + Â Â Â Â Â Â Â return ret;
>> > +
>> > + Â Â Â /*
>> > + Â Â Â Â* copy new assignment, now we know it is possible
>> > + Â Â Â Â* will be used by hw_perf_enable()
>> > + Â Â Â Â*/
>> > + Â Â Â memcpy(cpuc->assign, assign, n*sizeof(int));
>> > +
>> > + Â Â Â return 0;
>> > +}
>> > +
>> > Âstatic const struct pmu pmu = {
>> >    Â.enable     = x86_pmu_enable,
>> >    Â.disable    Â= x86_pmu_disable,
>> > @@ -1461,6 +1393,9 @@ static const struct pmu pmu = {
>> >    Â.stop      = x86_pmu_stop,
>> >    Â.read      = x86_pmu_read,
>> >    Â.unthrottle   = x86_pmu_unthrottle,
>> > +    .start_txn   Â= x86_pmu_start_txn,
>> > +    .stop_txn    = x86_pmu_stop_txn,
>> > +    .commit_txn   = x86_pmu_commit_txn,
>> > Â};
>> >
>> > Â/*
>> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> > index bf896d0..93aa8d8 100644
>> > --- a/include/linux/perf_event.h
>> > +++ b/include/linux/perf_event.h
>> > @@ -524,6 +524,8 @@ struct hw_perf_event {
>> >
>> > Âstruct perf_event;
>> >
>> > +#define PERF_EVENT_TRAN_STARTED 1
>> > +
>> > Â/**
>> > Â* struct pmu - generic performance monitoring unit
>> > Â*/
>> > @@ -534,6 +536,11 @@ struct pmu {
>> > Â Â Â Âvoid (*stop) Â Â Â Â Â Â Â Â Â Â(struct perf_event *event);
>> > Â Â Â Âvoid (*read) Â Â Â Â Â Â Â Â Â Â(struct perf_event *event);
>> > Â Â Â Âvoid (*unthrottle) Â Â Â Â Â Â Â(struct perf_event *event);
>> > + Â Â Â void (*start_txn) Â Â Â Â Â Â Â (struct pmu *pmu);
>> > + Â Â Â void (*stop_txn) Â Â Â Â Â Â Â Â(struct pmu *pmu);
>> > + Â Â Â int (*commit_txn) Â Â Â Â Â Â Â (struct pmu *pmu);
>> > +
>> > + Â Â Â u8 flag;
>> > Â};
>> >
>> > Â/**
>> > @@ -799,9 +806,6 @@ extern void perf_disable(void);
>> > Âextern void perf_enable(void);
>> > Âextern int perf_event_task_disable(void);
>> > Âextern int perf_event_task_enable(void);
>> > -extern int hw_perf_group_sched_in(struct perf_event *group_leader,
>> > - Â Â Â Â Â Â Âstruct perf_cpu_context *cpuctx,
>> > - Â Â Â Â Â Â Âstruct perf_event_context *ctx);
>> > Âextern void perf_event_update_userpage(struct perf_event *event);
>> > Âextern int perf_event_release_kernel(struct perf_event *event);
>> > Âextern struct perf_event *
>> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
>> > index 07b7a43..4537676 100644
>> > --- a/kernel/perf_event.c
>> > +++ b/kernel/perf_event.c
>> > @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
>> > Âvoid __weak hw_perf_disable(void) Â Â Â Â Â Â Â{ barrier(); }
>> > Âvoid __weak hw_perf_enable(void) Â Â Â Â Â Â Â { barrier(); }
>> >
>> > -int __weak
>> > -hw_perf_group_sched_in(struct perf_event *group_leader,
>> > - Â Â Â Â Â Â Âstruct perf_cpu_context *cpuctx,
>> > - Â Â Â Â Â Â Âstruct perf_event_context *ctx)
>> > -{
>> > - Â Â Â return 0;
>> > -}
>> > -
>> > Âvoid __weak perf_event_print_debug(void) Â Â Â { }
>> >
>> > Âstatic DEFINE_PER_CPU(int, perf_disable_count);
>> > @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event,
>> > Â Â Â Â Â Â Â struct perf_event_context *ctx)
>> > Â{
>> > Â Â Â Âstruct perf_event *event, *partial_group;
>> > + Â Â Â struct pmu *pmu = (struct pmu *)group_event->pmu;
>> > Â Â Â Âint ret;
>> >
>> > Â Â Â Âif (group_event->state == PERF_EVENT_STATE_OFF)
>> > Â Â Â Â Â Â Â Âreturn 0;
>> >
>> > - Â Â Â ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
>> > - Â Â Â if (ret)
>> > - Â Â Â Â Â Â Â return ret < 0 ? ret : 0;
>> > + Â Â Â pmu->start_txn(pmu);
>> >
>> > Â Â Â Âif (event_sched_in(group_event, cpuctx, ctx))
>> > Â Â Â Â Â Â Â Âreturn -EAGAIN;
>> > @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event,
>> > Â Â Â Â Â Â Â Â}
>> > Â Â Â Â}
>> >
>> > - Â Â Â return 0;
>> > + Â Â Â ret = pmu->commit_txn(pmu);
>> > + Â Â Â if (!ret) {
>> > + Â Â Â Â Â Â Â pmu->stop_txn(pmu);
>> > + Â Â Â Â Â Â Â return 0;
>> > + Â Â Â }
>> >
>> > Âgroup_error:
>> > + Â Â Â pmu->stop_txn(pmu);
>> > +
>> > Â Â Â Â/*
>> > - Â Â Â Â* Groups can be scheduled in as one unit only, so undo any
>> > - Â Â Â Â* partial group before returning:
>> > + Â Â Â Â* Commit transaction fails, rollback
>> > + Â Â Â Â* Groups can be scheduled in as one unit only, so undo
>> > + Â Â Â Â* whole group before returning:
>> > Â Â Â Â */
>> > Â Â Â Âlist_for_each_entry(event, &group_event->sibling_list, group_entry) {
>> > - Â Â Â Â Â Â Â if (event == partial_group)
>> > - Â Â Â Â Â Â Â Â Â Â Â break;
>> > Â Â Â Â Â Â Â Âevent_sched_out(event, cpuctx, ctx);
>> > Â Â Â Â}
>> > Â Â Â Âevent_sched_out(group_event, cpuctx, ctx);
>> >
>> >
>> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/