Re: [RFC v3 00/12] DRM scheduling cgroup controller

From: Michal Koutný
Date: Thu Jan 26 2023 - 08:01:43 EST


On Wed, Jan 25, 2023 at 06:11:35PM +0000, Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
> I don't immediately see how you envisage the half-userspace implementation
> would look like in terms of what functionality/new APIs would be provided by
> the kernel?

Output:
drm.stat (with consumed time(s))

Input:
drm.throttle (alternatives)
- a) writing 0,1 (in rough analogy to your proposed
notifications)
- b) writing duration (in loose analogy to memory.reclaim)
- for how long GPU work should be backed off

An userspace agent sitting between these two and it'd do the measurement
and calculation depending on given policies (weighting, throttling) and
apply respective controls.

(In resemblance of e.g. https://denji.github.io/cpulimit/)

> Problem there is to find a suitable point to charge at. If for a moment we
> limit the discussion to i915, out of the box we could having charging
> happening at several thousand times per second to effectively never. This is
> to illustrate the GPU context execution dynamics which range from many small
> packets of work to multi-minute, or longer. For the latter to be accounted
> for we'd still need some periodic scanning, which would then perhaps go per
> driver. For the former we'd have thousands of needless updates per second.
>
> Hence my thinking was to pay both the cost of accounting and collecting the
> usage data once per actionable event, where the latter is controlled by some
> reasonable scanning period/frequency.
>
> In addition to that, a few DRM drivers already support GPU usage querying
> via fdinfo, so that being externally triggered, it is next to trivial to
> wire all those DRM drivers into such common DRM cgroup controller framework.
> All that every driver needs to implement on top is the "over budget"
> callback.

I'd also like show comparison with CPU accounting and controller.
There is tick-based (~sampling) measurement of various components of CPU
time (task_group_account_field()). But the actual schedulling (weights)
or throttling is based on precise accounting (update_curr()).

So, if the goal is to have precise and guaranteed limits, it shouldn't
(cannot) be based on sampling. OTOH, if it must be sampling based due to
variability of the device landscape, it could be advisory mechanism with
the userspace component.

My 0.02€,
Michal

Attachment: signature.asc
Description: Digital signature