Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

From: stephane eranian
Date: Mon Jan 25 2010 - 12:48:25 EST

On Mon, Jan 25, 2010 at 6:25 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, 2010-01-25 at 18:12 +0100, stephane eranian wrote:
>> On Fri, Jan 22, 2010 at 9:27 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> > On Thu, 2010-01-21 at 17:39 +0200, Stephane Eranian wrote:
>> >> @@ -1395,40 +1430,28 @@ void hw_perf_enable(void)
>> >> Â Â Â Â Â Â Â Â Â* apply assignment obtained either from
>> >> Â Â Â Â Â Â Â Â Â* hw_perf_group_sched_in() or x86_pmu_enable()
>> >> Â Â Â Â Â Â Â Â Â*
>> >> - Â Â Â Â Â Â Â Â* step1: save events moving to new counters
>> >> - Â Â Â Â Â Â Â Â* step2: reprogram moved events into new counters
>> >> + Â Â Â Â Â Â Â Â* We either re-enable or re-program and re-enable.
>> >> + Â Â Â Â Â Â Â Â* All events are disabled by the time we come here.
>> >> + Â Â Â Â Â Â Â Â* That means their state has been saved already.
>> >> Â Â Â Â Â Â Â Â Â*/
>> >
>> > I'm not seeing how it is true.
>> > Suppose a core2 with counter0 active counting a non-restricted event,
>> > say cpu_cycles. Then we do:
>> >
>> > perf_disable()
>> > Âhw_perf_disable()
>> > Â Âintel_pmu_disable_all
>> >
>> everything is disabled globally, yet individual counter0 is not.
>> But that's enough to stop it.
>> > ->enable(MEM_LOAD_RETIRED) /* constrained to counter0 */
>> > Âx86_pmu_enable()
>> > Â Âcollect_events()
>> > Â Âx86_schedule_events()
>> > Â Ân_added = 1
>> >
>> > Â Â/* also slightly confused about this */
>> > Â Âif (hwc->idx != -1)
>> > Â Â Âx86_perf_event_set_period()
>> >
>> In x86_pmu_enable(), we have not yet actually assigned the
>> counter to hwc->idx. This is only accomplished by hw_perf_enable().
>> Yet, x86_perf_event_set_period() is going to write the MSR.
>> My understanding is that you never call enable(event) in code
>> outside of a perf_disable()/perf_enable() section.
> That should be so yes, last time I verified that is was. Hence I'm a bit
> puzzled by that set_period(), hw_perf_enable() will assign ->idx and do
> set_period() so why also do it here...

Ok, so I think we can drop set_period() from enable(event).

>> > perf_enable()
>> > Âhw_perf_enable()
>> >
>> > Â Â/* and here we'll assign the new event to counter0
>> > Â Â * except we never disabled it... */
>> >
>> You will have two events, scheduled, cycles in counter1
>> and mem_load_retired in counter0. Neither hwc->idx
>> will match previous state and thus both will be rewritten.
> And by programming mem_load_retires you just wiped the counter value of
> the cycle counter, there should be an x86_perf_event_update() in between
> stopping the counter and moving that configuration.
>> I think the case you are worried about is different. It is the
>> case where you would move an event to a new counter
>> without replacing it with a new event. Given that the individual
>> MSR.en would still be 1 AND that enable_all() enables all
>> counters (even the ones not actively used), then we would
>> get a runaway counter so to speak.
>> It seems a solution would be to call x86_pmu_disable() before
>> assigning an event to a new counter for all events which are
>> moving. This is because we cannot assume all events have been
>> previously disabled individually. Something like
>> if (!match_prev_assignment(hwc, cpuc, i)) {
>> Â Âif (hwc->idx != -1)
>> Â Â Â x86_pmu.disable(hwc, hwc->idx);
>> Â Âx86_assign_hw_event(event, cpuc, cpuc->assign[i]);
>> Â Âx86_perf_event_set_period(event, hwc, hwc->idx);
>> }
> Yes and no, my worry is not that its not counting, but that we didn't
> store the actual counter value before over-writing it with the new
> configuration.
> As to your suggestion,
> Â1) we would have to do x86_pmu_disable() since that does
> x86_perf_event_update().
> Â2) I worried about the case where we basically switch two counters,
> there we cannot do the x86_perf_event_update() in a single pass since
> programming the first counter will destroy the value of the second.
> Now possibly the scenario in 2 isn't possible because the event
> scheduling is stable enough for this never to happen, but I wasn't
> feeling too sure about that, so I skipped this part for now.
I think what adds to the complexity here is that there are two distinct
disable() mechanisms: perf_disable() and x86_pmu.disable(). They
don't operate the same way. You would think that by calling hw_perf_disable()
you would stop individual events as well (thus saving their values). That
means that if you do perf_disable() and then read the count, you will not
get the up-to-date value in event->count. you need pmu->disable(event)
to ensure that.

So my understanding is that perf_disable() is meant for a temporary stop,
thus no need to save the count.

As for 2, I believe this can happen if you add 2 new events which have more
restrictions. For instance on Core, you were measuring cycles, inst in generic
counters, then you add 2 events which can only be measured on generic counters.
That will cause cycles, inst to be moved to fixed counters.

So we have to modify hw_perf_enable() to first disable all events
which are moving,
then reprogram them. I suspect it may be possible to optimize this if
we detect that
those events had already been stopped individually (as opposed to
perf_disable()), i.e.,
already had their counts saved.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at