Re: [PATCH 16/17] cgroup/drm: Expose memory stats

From: Tvrtko Ursulin
Date: Wed Jul 26 2023 - 12:44:59 EST



On 21/07/2023 23:21, Tejun Heo wrote:
On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
$ cat drm.memory.stat
card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

Data is generated on demand for simplicty of implementation ie. no running
totals are kept or accounted during migrations and such. Various
optimisations such as cheaper collection of data are possible but
deliberately left out for now.

Overall, the feature is deemed to be useful to container orchestration
software (and manual management).

Limits, either soft or hard, are not envisaged to be implemented on top of
this approach due on demand nature of collecting the stats.

So, yeah, if you want to add memory controls, we better think through how
the fd ownership migration should work.

It would be quite easy to make the implicit migration fail - just the matter of failing the first ioctl, which is what triggers the migration, after the file descriptor access from a new owner.

But I don't think I can really add that in the RFC given I have no hard controls or anything like that.

With GPU usage throttling it doesn't really apply, at least I don't think it does, since even when migrated to a lower budget group it would just get immediately de-prioritized.

I don't think hard GPU time limits are feasible in general, and while soft might be, again I don't see that any limiting would necessarily have to run immediately on implicit migration.

Second part of the story are hypothetical/future memory controls.

I think first thing to say is that implicit migration is important, but it is not really established to use the file descriptor from two places or to migrate more than once. It is simply fresh fd which gets sent to clients from Xorg, which is one of the legacy ways of doing things.

So we probably can just ignore that given no significant amount of memory ownership would be getting migrated.

And for drm.memory.stat I think what I have is good enough - both private and shared data get accounted, for any clients that have handles to particular buffers.

Maarten was working on memory controls so maybe he would have more thoughts on memory ownership and implicit migration.

But I don't think there is anything incompatible with that and drm.memory.stats as proposed here, given how the categories reported are the established ones from the DRM fdinfo spec, and it is fact of the matter that we can have multiple memory regions per driver.

The main thing that would change between this RFC and future memory controls in the area of drm.memory.stat is the implementation - it would have to get changed under the hood from "collect on query" to "account at allocation/free/etc". But that is just implementation details.

Regards,

Tvrtko