Re: [RFC 00/17] DRM scheduling cgroup controller

From: Tvrtko Ursulin
Date: Thu Oct 27 2022 - 10:36:47 EST



Hi Tejun,

On 19/10/2022 19:45, Tejun Heo wrote:
Hello,

On Wed, Oct 19, 2022 at 06:32:37PM +0100, Tvrtko Ursulin wrote:
...
DRM static priority interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

drm.priority_levels
One of:
1) And integer representing the minimum number of discrete priority
levels for the whole group.
Optionally followed by an asterisk ('*') indicating some DRM clients
in the group support more than the minimum number.
2) '0'- indicating one or more DRM clients in the group has no support
for static priority control.
3) 'n/a' - when there are no DRM clients in the configured group.

drm.priority
A read-write integer between -10000 and 10000 (inclusive) representing
an abstract static priority level.

drm.effective_priority
Read only integer showing the current effective priority level for the
group. Effective meaning taking into account the chain of inherited

From interface POV, this is a lot worse than the second proposal and I'd
really like to avoid this. Even if we go with mapping user priority
configuration to per-driver priorities, I'd much prefer if the interface
presented to user is weight based and let each driver try to match the
resulting hierarchical weight (ie. the absolute proportion a given cgroup
should have at the point in time) as best as they can rather than exposing
opaque priority numbers to userspace whose meaning isn't defined at all.

I actually somewhat agree here and this proposal was mainly motivated by my desire to come up with *something* which would allow similar _external_ control as it exists in the CPU and IO world (nice/ionice).

Because currently priority of GPU workloads cannot be controlled from the outside at all. And especially if we consider pipelines composed of stages where part of the processing is done on the CPU and part on the GPU (or AI/VPU accelerator), then I think it would be really useful to be able to do so.

To this effect I proposed connecting CPU nice with GPU priority, same as it is connected for IO priority (so only if it hasn't been explicitly changed away from the defaults), but at that point I was getting feedback nudging me into direction of cgroups.

Looking at what's available in cgroups world now, I have spotted the blkio.prio.class control. For my current use case (lower GPU scheduling of background/unfocused windows) that would also work. Even if starting with just two possible values - 'no-change' and 'idle' (to follow the IO controller naming).

How would you view that as a proposal? It would be a bit tougher "sell" to the DRM community, perhaps, given that many drivers do have scheduling priority, but the concept of scheduling class is not really there. Nevertheless a concept of normal-vs-background could be plausible in my mind. It could be easily implemented by using the priority controls available in many drivers.

DRM scheduling soft limits interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

drm.weight
Standard cgroup weight based control [1, 10000] used to configure the
relative distributing of GPU time between the sibling groups.

Please take a look at io.weight. This can follow the same convention to
express both global and per-device weights.

drm.period_us
An integer representing the period with which the controller should look
at the GPU usage by the group and potentially send the over/under budget
signal.
Value of zero (defaul) disables the soft limit checking.

Can we not do period_us or at least make it a per-driver tuning parameter
exposed as module param? Weight, users can easily understand and configure.
period_us is a lot more an implementation detail. If we want to express the
trade-off between latency and bandwidth at the interface, we prolly should
encode the latency requirement in a more canonical way but let's leave that
for the future.

Yes agreed - for the moment (while RFC) it is handy for me to have it controllable per group. But I agree a kernel global (modparam) should be sufficient in the final solution.


drm.budget_supported
One of:
1) 'yes' - when all DRM clients in the group support the functionality.
2) 'no' - when at least one of the DRM clients does not support the
functionality.
3) 'n/a' - when there are no DRM clients in the group.

Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
with and I'm not sure 'no' meaning at least one device not supporting is
intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
drop this.

The idea actually is for this to be per-cgroup and potentially change dynamically. It implements the concept of "observability", how I have, perhaps clumsily, named it.

This is because we can have a mix of DRM file descriptors in a cgroup, not all of which support the proposed functionality. So I wanted to have something by which the administrator can observe the status of the group.

For instance seeing some clients do not support the feature could be signal that things have been misconfigured, or that appeal needs to be made for driver X to start supporting the feature. Seeing a "no" there in other words is a signal that budgeting may not really work as expected and needs to be investigated.

Another basic interface question. Is everyone happy with the drm prefix or
should it be something like gpu? Also, in the future, if there's a consensus
around how to control gpu memory, what prefix would that take?

Given the current effort to bring in AI/VPU/ML accelerator devices under the DRM umbrella I think drm prefix is okay. At least some of those devices (the ones I looked at) will accept the same concepts of scheduling and memory allocation as well.

And for memory specifically, I don't see why the drm prefix would not work.

But yeah, I echo the calls for wider feedback. Given how drm cgroup controller has been on a wish list for ages I have to say I expected a bit more interest. :)

The second proposal is a little bit more advanced in concept and also a little
bit less finished. Interesting thing is that it builds upon the per client GPU
utilisation work which landed recently for a few drivers. So my thinking is that
in principle, an intersect of drivers which support both that and some sort of
priority scheduling control, could also in theory support this.

Another really interesting angle for this controller is that it mimics the same
control menthod used by the CPU scheduler. That is the proportional/weight based
GPU time budgeting. Which makes it easy to configure and does not need a new
mental model.

However, as the introduction mentions, GPUs are much more heterogenous and
therefore the controller uses very "soft" wording as to what it promises. The
general statement is that it can define budgets, notify clients when they are
over them, and let individual drivers implement best effort handling of those
conditions.

Delegation of duties in the implementation goes likes this:

* DRM cgroup controller implements the control files and the scanning loop.
* DRM core is required to track all DRM clients belonging to processes so it
can answer when asked how much GPU time is a process using.
* DRM core also provides a call back which the controller will call when a
certain process is over budget.
* Individual drivers need to implement two similar hooks, but which work for
a single DRM client. Over budget callback and GPU utilisation query.

What I have demonstrated in practice is that when wired to i915, in a really
primitive way where the over-budget condition simply lowers the scheduling
priority, the concept can be almost equally effective as the static priority
control. I say almost because the design where budget control depends on the
periodic usage scanning has a fundamental delay, so responsiveness will depend
on the scanning period, which may or may not be a problem for a particular use
case.

The unfinished part is the GPU budgeting split which currently does not
propagate unused bandwith to children, neither can share it with siblings. But
this is not due fundamental reasons, just to avoid spending too much time on it
too early.

Rather than doing it hierarchically on the spot, it's usually a lot cheaper
and easier to calculate the flattened hierarchical weight per leaf cgroup
and divide the bandwidth according to the eventual portions. For an example,
please take a look at block/blk-iocost.c.

This seems exactly what I had in mind (but haven't implemented it yet). So in this RFC I have budget splitting per group where each tree level adds up to "100%" and the thing which I have not implemented is "borrowing" or yielding (how blk-iocost.c calls it, or donating) unused budgets to siblings.

I am very happy to hear my idea is the right one and someone already implemented it. Thanks for this pointer!

I don't know much about the drm driver side, so can't comment much on it but
I do really like the idea of having the core implementation determining who
should get how much and then letting each driver enforce the target. That
seems a lot more robust and generic than trying to somehow coax and expose
per-driver priority implementations directly.

Thanks, and thanks for having a detailed read and providing great feedback so far!

Regards,

Tvrtko