Re: [Regression] drm/scheduler: track GPU active time per entity

From: Danilo Krummrich
Date: Wed Apr 05 2023 - 11:38:25 EST

Next message: Jacob Pan: "[PATCH v3] iommu/vt-d: Fix PASID directory pointer coherency"
Previous message: Jordan Crouse: "Re: [PATCH] drm/msm: Check for the GPU IOMMU during bind"
In reply to: Luben Tuikov: "Re: [Regression] drm/scheduler: track GPU active time per entity"
Next in thread: Luben Tuikov: "Re: [Regression] drm/scheduler: track GPU active time per entity"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/4/23 06:31, Luben Tuikov wrote:

On 2023-03-28 04:54, Lucas Stach wrote:

Hi Danilo,

Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:

Hi all,

Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
tries to track the accumulated time that a job was active on the GPU
writing it to the entity through which the job was deployed to the
scheduler originally. This is done within drm_sched_get_cleanup_job()
which fetches a job from the schedulers pending_list.

Doing this can result in a race condition where the entity is already
freed, but the entity's newly added elapsed_ns field is still accessed
once the job is fetched from the pending_list.

After drm_sched_entity_destroy() being called it should be safe to free
the structure that embeds the entity. However, a job originally handed
over to the scheduler by this entity might still reside in the
schedulers pending_list for cleanup after drm_sched_entity_destroy()
already being called and the entity being freed. Hence, we can run into
a UAF.

Sorry about that, I clearly didn't properly consider this case.

In my case it happened that a job, as explained above, was just picked
from the schedulers pending_list after the entity was freed due to the
client application exiting. Meanwhile this freed up memory was already
allocated for a subsequent client applications job structure again.
Hence, the new jobs memory got corrupted. Luckily, I was able to
reproduce the same corruption over and over again by just using
deqp-runner to run a specific set of VK test cases in parallel.

Fixing this issue doesn't seem to be very straightforward though (unless
I miss something), which is why I'm writing this mail instead of sending
a fix directly.

Spontaneously, I see three options to fix it:

1. Rather than embedding the entity into driver specific structures
(e.g. tied to file_priv) we could allocate the entity separately and
reference count it, such that it's only freed up once all jobs that were
deployed through this entity are fetched from the schedulers pending list.

My vote is on this or something in similar vain for the long term. I
have some hope to be able to add a GPU scheduling algorithm with a bit
more fairness than the current one sometime in the future, which
requires execution time tracking on the entities.

Danilo,

Using kref is preferable, i.e. option 1 above.

I think the only real motivation for doing that would be for generically tracking job statistics within the entity a job was deployed through. If we all agree on tracking job statistics this way I am happy to prepare a patch for this option and drop this one: https://lore.kernel.org/all/20230331000622.4156-1-dakr@xxxxxxxxxx/T/#u

Christian mentioned amdgpu tried something similar to what Lucas tried running into similar trouble, backed off and implemented it in another way - a driver specific way I guess?

Lucas, can you shed some light on,

1. In what way the current FIFO scheduling is unfair, and
2. shed some details on this "scheduling algorithm with a bit
more fairness than the current one"?

Regards,
Luben

2. Somehow make sure drm_sched_entity_destroy() does block until all
jobs deployed through this entity were fetched from the schedulers
pending list. Though, I'm pretty sure that this is not really desirable.

3. Just revert the change and let drivers implement tracking of GPU
active times themselves.

Given that we are already pretty late in the release cycle and etnaviv
being the only driver so far making use of the scheduler elapsed time
tracking I think the right short term solution is to either move the
tracking into etnaviv or just revert the change for now. I'll have a
look at this.

Regards,
Lucas

In the case of just reverting the change I'd propose to also set a jobs
entity pointer to NULL once the job was taken from the entity, such
that in case of a future issue we fail where the actual issue resides
and to make it more obvious that the field shouldn't be used anymore
after the job was taken from the entity.

I'm happy to implement the solution we agree on. However, it might also
make sense to revert the change until we have a solution in place. I'm
also happy to send a revert with a proper description of the problem.
Please let me know what you think.

- Danilo

Next message: Jacob Pan: "[PATCH v3] iommu/vt-d: Fix PASID directory pointer coherency"
Previous message: Jordan Crouse: "Re: [PATCH] drm/msm: Check for the GPU IOMMU during bind"
In reply to: Luben Tuikov: "Re: [Regression] drm/scheduler: track GPU active time per entity"
Next in thread: Luben Tuikov: "Re: [Regression] drm/scheduler: track GPU active time per entity"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]