Re: [RFC 0/6] Non perf based Gen Graphics OA unit driver

From: Peter Zijlstra
Date: Fri Oct 16 2015 - 05:43:25 EST


On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote:
> - We're bridging two complex architectures
>
> To review this work I think it will be relevant to have a good
> general familiarity with Gen graphics (e.g. thinking about the OA
> unit's interaction with the command streamer and execlist
> scheduling) as well as our userspace architecture and how we're
> consuming OA data within Mesa to implement the
> INTEL_performance_query extension.
>
> On the flip side here, its necessary to understand the perf
> userspace interface (for most this is hidden by tools so the details
> aren't common knowledge) as well as the internal design, considering
> that the PMU we're looking at seems to break several current design
> assumptions. I can only claim a limited familiarity with perf's
> design, just as a result of this work.

Right; but a little effort and patience on both sides should get us
there I think. At worst we'll both learn something new ;-)

> - The current OA PMU driver breaks some significant design assumptions.
>
> Existing perf pmus are used for profiling work on a cpu and we're
> introducing the idea of _IS_DEVICE pmus with different security
> implications, the need to fake cpu-related data (such as user/kernel
> registers) to fit with perf's current design, and adding _DEVICE
> records as a way to forward device-specific status records.

There are more devices with counters on than GPUs, so I think it might
make sense to look at extending perf to better deal with this.

> The OA unit writes reports of counters into a circular buffer,
> without involvement from the CPU, making our PMU driver the first of
> a kind.

Agreed, this is somewhat 'odd' from where we are today.

> Perf supports groups of counters and allows those to be read via
> transactions internally but transactions currently seem designed to
> be explicitly initiated from the cpu (say in response to a userspace
> read()) and while we could pull a report out of the OA buffer we
> can't trigger a report from the cpu on demand.
>
> Related to being report based; the OA counters are configured in HW
> as a set while perf generally expects counter configurations to be
> orthogonal. Although counters can be associated with a group leader
> as they are opened, there's no clear precedent for being able to
> provide group-wide configuration attributes and no obvious solution
> as yet that's expected to be acceptable to upstream and meets our
> userspace needs.

I'm not entirely sure what you mean with group-wide configuration
attributes; could you elaborate?

> We currently avoid using perf's grouping feature
> and forward OA reports to userspace via perf's 'raw' sample field.
> This suits our userspace well considering how coupled the counters
> are when dealing with normalizing. It would be inconvenient to split
> counters up into separate events, only to require userspace to
> recombine them.

So IF you were using a group, a single read from the leader can return
you a vector of all values (PERF_FORMAT_GROUP), this avoids having to
do that recombine.

Another option would be to view the arrival of an OA vector in the
datastream as an 'event' and generate a PERF_RECORD_READ in the perf
buffer (which again can use the GROUP vector format).

> Related to counter orthogonality; we can't time share the OA unit,
> while event scheduling is a central design idea within perf for
> allowing userspace to open + enable more events than can be
> configured in HW at any one time.

So we have other PMUs that cannot do this; Gen OA would not be unique in
this. Intel PT for example only allows a single active event.

That said; earlier today I saw:

https://www.youtube.com/watch?v=9J3BQcAeHpI&list=PLe6I3NKr-I4J2oLGXhGOeBMEjh8h10jT3&index=7

where exactly this feature was mentioned as not fitting well into the
existing GPU performance interfaces (GL_AMD_performance_monitor /
GL_INTEL_performance_query).

So there is hardware (Nvidia) out there that does support this. Also
mentioned was that this hardware has global and local counters, where
the local ones are specific to a rendering context. That is not unlike
the per-cpu / per-task stuff perf does.

> The OA unit is not designed to
> allow re-configuration while in use. We can't reconfigure the OA
> unit without loosing internal OA unit state which we can't access
> explicitly to save and restore. Reconfiguring the OA unit is also
> relatively slow, involving ~100 register writes. From userspace Mesa
> also depends on a stable OA configuration when emitting
> MI_REPORT_PERF_COUNT commands and importantly the OA unit can't be
> disabled while there are outstanding MI_RPC commands lest we hang
> the command streamer.

Right; see the PERF_PMU_CAP_EXCLUSIVE stuff.

> - We may be making some technical compromises a.t.m for the sake of
> using perf.
>
> perf_event_open() requires events to either relate to a pid or a
> specific cpu core, while our device pmu relates to neither. Events
> opened with a pid will be automatically enabled/disabled according
> to the scheduling of that process - so not appropriate for us.

Right; the traditional cpu/pid mapping doesn't work well for devices;
but maybe, with some work, we can create something like that
global/local render context from it; although I've no clue what form
that would need at this time.

> When
> an event is related to a cpu id, perf ensures pmu methods will be
> invoked via an inter process interrupt on that core. To avoid
> invasive changes our userspace opens OA perf events for a specific
> cpu.

Some of that might still make sense in the sense that GPUs are subject
to the NUMA topology of machines. I would think you would want most
such things to be done on the node the device is attached to.

Granted, this might not be a concern for Intel graphics, but it might be
relevant for some of the discrete GPUs.

> - I'm not confident our use case benefits much from building on perf:
>
> We aren't using existing perf based tooling with our PMU. Existing
> tools typically assume you're profiling work running on a cpu, e.g.
> expecting samples to be associated with instruction pointers and
> user/kernel registers and aiming to represent metrics in relation
> to application source code. We're forwarding fake register values
> and userspace needs needs to know how to decode the raw OA reports
> before anything can be reported to a user.
>
> With the buffering done by the OA unit I don't think we currently
> benefit from perf's mmapped circular buffer interface. We already
> have a decoupled producer and consumer and since we have to copy out
> of the OA buffer, it would work well for us to hide that copy in
> a simpler read() based interface.
>
>
> - Logistically it might be more practical to contain this to the
> graphics stack.
>
> It seems fair to consider that if we can't see a very compelling
> benefit to building on perf, then containing this work to
> drivers/gpu/drm/i915 may simplify the review process as well as
> future maintenance and development.

> Peter; I wonder if you would tend to agree too that it could make sense
> for us to go with our own interface here?

Sorry this took so long; this wanted a well considered response and
those tend to get delayed in light of 'urgent' stuff.

While I can certainly see the pain points and why you would rather not
deal with them. I think it would make Linux a better place if we could
manage to come up with a generic interface that would work for 'all'
GPUs (and possibly more devices).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/