Re: [Regression] drm/scheduler: track GPU active time per entity

From: Luben Tuikov
Date: Wed Apr 05 2023 - 12:10:00 EST


On 2023-04-05 10:05, Danilo Krummrich wrote:
> On 4/4/23 06:31, Luben Tuikov wrote:
>> On 2023-03-28 04:54, Lucas Stach wrote:
>>> Hi Danilo,
>>>
>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>> Hi all,
>>>>
>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>> tries to track the accumulated time that a job was active on the GPU
>>>> writing it to the entity through which the job was deployed to the
>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>> which fetches a job from the schedulers pending_list.
>>>>
>>>> Doing this can result in a race condition where the entity is already
>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>> once the job is fetched from the pending_list.
>>>>
>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>> the structure that embeds the entity. However, a job originally handed
>>>> over to the scheduler by this entity might still reside in the
>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>> already being called and the entity being freed. Hence, we can run into
>>>> a UAF.
>>>>
>>> Sorry about that, I clearly didn't properly consider this case.
>>>
>>>> In my case it happened that a job, as explained above, was just picked
>>>> from the schedulers pending_list after the entity was freed due to the
>>>> client application exiting. Meanwhile this freed up memory was already
>>>> allocated for a subsequent client applications job structure again.
>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>> reproduce the same corruption over and over again by just using
>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>
>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>> I miss something), which is why I'm writing this mail instead of sending
>>>> a fix directly.
>>>>
>>>> Spontaneously, I see three options to fix it:
>>>>
>>>> 1. Rather than embedding the entity into driver specific structures
>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>> reference count it, such that it's only freed up once all jobs that were
>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>
>>> My vote is on this or something in similar vain for the long term. I
>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>> more fairness than the current one sometime in the future, which
>>> requires execution time tracking on the entities.
>>
>> Danilo,
>>
>> Using kref is preferable, i.e. option 1 above.
>
> I think the only real motivation for doing that would be for generically
> tracking job statistics within the entity a job was deployed through. If
> we all agree on tracking job statistics this way I am happy to prepare a
> patch for this option and drop this one:
> https://lore.kernel.org/all/20230331000622.4156-1-dakr@xxxxxxxxxx/T/#u

Hmm, I never thought about "job statistics" when I preferred using kref above.
The reason kref is attractive is because one doesn't need to worry about
it--when the last user drops the kref, the release is called to do
housekeeping. If this never happens, we know that we have a bug to debug.

Regarding the patch above--I did look around the code, and it seems safe,
as per your analysis, I didn't see any reference to entity after job submission,
but I'll comment on that thread as well for the record.

Regards,
Luben

>
> Christian mentioned amdgpu tried something similar to what Lucas tried
> running into similar trouble, backed off and implemented it in another
> way - a driver specific way I guess?
>
>>
>> Lucas, can you shed some light on,
>>
>> 1. In what way the current FIFO scheduling is unfair, and
>> 2. shed some details on this "scheduling algorithm with a bit
>> more fairness than the current one"?
>>
>> Regards,
>> Luben
>>
>>>
>>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all
>>>> jobs deployed through this entity were fetched from the schedulers
>>>> pending list. Though, I'm pretty sure that this is not really desirable.
>>>>
>>>> 3. Just revert the change and let drivers implement tracking of GPU
>>>> active times themselves.
>>>>
>>> Given that we are already pretty late in the release cycle and etnaviv
>>> being the only driver so far making use of the scheduler elapsed time
>>> tracking I think the right short term solution is to either move the
>>> tracking into etnaviv or just revert the change for now. I'll have a
>>> look at this.
>>>
>>> Regards,
>>> Lucas
>>>
>>>> In the case of just reverting the change I'd propose to also set a jobs
>>>> entity pointer to NULL once the job was taken from the entity, such
>>>> that in case of a future issue we fail where the actual issue resides
>>>> and to make it more obvious that the field shouldn't be used anymore
>>>> after the job was taken from the entity.
>>>>
>>>> I'm happy to implement the solution we agree on. However, it might also
>>>> make sense to revert the change until we have a solution in place. I'm
>>>> also happy to send a revert with a proper description of the problem.
>>>> Please let me know what you think.
>>>>
>>>> - Danilo
>>>>
>>>
>>
>