Re: [Regression] drm/scheduler: track GPU active time per entity

From: Lucas Stach
Date: Tue Mar 28 2023 - 04:55:00 EST


Hi Danilo,

Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> Hi all,
>
> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> tries to track the accumulated time that a job was active on the GPU
> writing it to the entity through which the job was deployed to the
> scheduler originally. This is done within drm_sched_get_cleanup_job()
> which fetches a job from the schedulers pending_list.
>
> Doing this can result in a race condition where the entity is already
> freed, but the entity's newly added elapsed_ns field is still accessed
> once the job is fetched from the pending_list.
>
> After drm_sched_entity_destroy() being called it should be safe to free
> the structure that embeds the entity. However, a job originally handed
> over to the scheduler by this entity might still reside in the
> schedulers pending_list for cleanup after drm_sched_entity_destroy()
> already being called and the entity being freed. Hence, we can run into
> a UAF.
>
Sorry about that, I clearly didn't properly consider this case.

> In my case it happened that a job, as explained above, was just picked
> from the schedulers pending_list after the entity was freed due to the
> client application exiting. Meanwhile this freed up memory was already
> allocated for a subsequent client applications job structure again.
> Hence, the new jobs memory got corrupted. Luckily, I was able to
> reproduce the same corruption over and over again by just using
> deqp-runner to run a specific set of VK test cases in parallel.
>
> Fixing this issue doesn't seem to be very straightforward though (unless
> I miss something), which is why I'm writing this mail instead of sending
> a fix directly.
>
> Spontaneously, I see three options to fix it:
>
> 1. Rather than embedding the entity into driver specific structures
> (e.g. tied to file_priv) we could allocate the entity separately and
> reference count it, such that it's only freed up once all jobs that were
> deployed through this entity are fetched from the schedulers pending list.
>
My vote is on this or something in similar vain for the long term. I
have some hope to be able to add a GPU scheduling algorithm with a bit
more fairness than the current one sometime in the future, which
requires execution time tracking on the entities.

> 2. Somehow make sure drm_sched_entity_destroy() does block until all
> jobs deployed through this entity were fetched from the schedulers
> pending list. Though, I'm pretty sure that this is not really desirable.
>
> 3. Just revert the change and let drivers implement tracking of GPU
> active times themselves.
>
Given that we are already pretty late in the release cycle and etnaviv
being the only driver so far making use of the scheduler elapsed time
tracking I think the right short term solution is to either move the
tracking into etnaviv or just revert the change for now. I'll have a
look at this.

Regards,
Lucas

> In the case of just reverting the change I'd propose to also set a jobs
> entity pointer to NULL once the job was taken from the entity, such
> that in case of a future issue we fail where the actual issue resides
> and to make it more obvious that the field shouldn't be used anymore
> after the job was taken from the entity.
>
> I'm happy to implement the solution we agree on. However, it might also
> make sense to revert the change until we have a solution in place. I'm
> also happy to send a revert with a proper description of the problem.
> Please let me know what you think.
>
> - Danilo
>