Re: [tip:perf/urgent] perf/x86: Enable raw event access to Inteloffcore events

From: Stephane Eranian
Date: Mon Nov 21 2011 - 16:39:15 EST


On Mon, Nov 21, 2011 at 10:04 PM, Vince Weaver <vweaver1@xxxxxxxxxxxx> wrote:
> On Mon, 21 Nov 2011, Stephane Eranian wrote:
>
>> > We have a workaround, but it currently disables kernel multiplexing and a
>> > few other nice features.
>>
>> I don't understand why you have a problem with NMI watchdog. Multiplexing
>> allows you to still measure more events than there are counters.
>
> The PAPI code currently uses FORMAT_GROUP and puts as many events as
> possible in a group. ÂThe way we maximize events in a group is to
> add events until perf_events indicates a failure.
>
Ok, now I understand your problem.

When you submit events (one by one) the kernel does a scheduling simulation
but only taking into account events in the group. The goal is to
verify that the group
is schedulable. It does not look at events from system-wide sessions
(which have priority).
Doing so would be difficult for per-thread because it would have to
look on all CPUs.
Even if it were to do that, that would be no guarantee of
schedulability because
by the time the group is actually scheduled, you may now have
system-wide session, i.e.,
your group is not scheduled because there aren't enough counters
leftover from system-wide.

The problem you're describing is not specific to X86 and the NMI
watchdog. It applies to
all architectures and has to do with system-wide pinned events vs.
per-thread group schedulability.
The NMI watchdog on X86 is an example of that. But AFAIK, the watchdog
could be supported
on other architectures as well.

One could argue, you could check is NMI is active and assume you have
one fewer counters
during the scheduling simulation. But I think it is a bit more
complicated than that.

To solve this problem (in the general case), you need to know which
counter is taken (or required)
by ALL pinned system-wide events across all CPUs. Once you've
constructed the bitmap, you can
use it as the basis (used_mask) to try and schedule the group events.

That's a best effort algorithm. It does not provide the guarantee, the
group will be schedulable
during its entire lifetime. It simply makes it more likely it will
run, if you assume no changes on
system-wide pinned events from the moment you create the per-thread
group and the time is
starts counting.


> When NMI watchdog is enabled, a counter is stolen. ÂYet the perf_events
> code does not account for this.
>
> So say on an AMD machine with 4 counters (3 after one is stolen)
> perf_events lets you add 4 events to an event group, even though only 3
> are available. ÂIt does not report failure upon open or start, only at
> read. ÂBy then it's too late.
>
> We have to work around this, by doing an extra read at open time to verify
> that the event group actually is valid, adding overhead.
>
> Our multiplex code tries to maximize the number of events in a group too.
> Currently PAPI works around this by just not doing kernel multiplexing
> if a NMI watchdog is detected. ÂThere's probably more elegant solutions
> such as checking with a read there too, or not using FORMAT_GROUP at all,
> but since this is an ABI regression I was hoping it would get fixed
> quickly enough that I wouldn't have to construct better workarounds.
>
> Vince
> vweaver1@xxxxxxxxxxxx
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/