Re: [Regression] drm/scheduler: track GPU active time per entity

From: Daniel Vetter
Date: Thu Apr 06 2023 - 06:10:02 EST


On Thu, Apr 06, 2023 at 06:05:11PM +0900, Asahi Lina wrote:
> On 06/04/2023 17.27, Daniel Vetter wrote:
> > On Thu, 6 Apr 2023 at 10:22, Christian König <christian.koenig@xxxxxxx> wrote:
> > >
> > > Am 05.04.23 um 18:09 schrieb Luben Tuikov:
> > > > On 2023-04-05 10:05, Danilo Krummrich wrote:
> > > > > On 4/4/23 06:31, Luben Tuikov wrote:
> > > > > > On 2023-03-28 04:54, Lucas Stach wrote:
> > > > > > > Hi Danilo,
> > > > > > >
> > > > > > > Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
> > > > > > > > tries to track the accumulated time that a job was active on the GPU
> > > > > > > > writing it to the entity through which the job was deployed to the
> > > > > > > > scheduler originally. This is done within drm_sched_get_cleanup_job()
> > > > > > > > which fetches a job from the schedulers pending_list.
> > > > > > > >
> > > > > > > > Doing this can result in a race condition where the entity is already
> > > > > > > > freed, but the entity's newly added elapsed_ns field is still accessed
> > > > > > > > once the job is fetched from the pending_list.
> > > > > > > >
> > > > > > > > After drm_sched_entity_destroy() being called it should be safe to free
> > > > > > > > the structure that embeds the entity. However, a job originally handed
> > > > > > > > over to the scheduler by this entity might still reside in the
> > > > > > > > schedulers pending_list for cleanup after drm_sched_entity_destroy()
> > > > > > > > already being called and the entity being freed. Hence, we can run into
> > > > > > > > a UAF.
> > > > > > > >
> > > > > > > Sorry about that, I clearly didn't properly consider this case.
> > > > > > >
> > > > > > > > In my case it happened that a job, as explained above, was just picked
> > > > > > > > from the schedulers pending_list after the entity was freed due to the
> > > > > > > > client application exiting. Meanwhile this freed up memory was already
> > > > > > > > allocated for a subsequent client applications job structure again.
> > > > > > > > Hence, the new jobs memory got corrupted. Luckily, I was able to
> > > > > > > > reproduce the same corruption over and over again by just using
> > > > > > > > deqp-runner to run a specific set of VK test cases in parallel.
> > > > > > > >
> > > > > > > > Fixing this issue doesn't seem to be very straightforward though (unless
> > > > > > > > I miss something), which is why I'm writing this mail instead of sending
> > > > > > > > a fix directly.
> > > > > > > >
> > > > > > > > Spontaneously, I see three options to fix it:
> > > > > > > >
> > > > > > > > 1. Rather than embedding the entity into driver specific structures
> > > > > > > > (e.g. tied to file_priv) we could allocate the entity separately and
> > > > > > > > reference count it, such that it's only freed up once all jobs that were
> > > > > > > > deployed through this entity are fetched from the schedulers pending list.
> > > > > > > >
> > > > > > > My vote is on this or something in similar vain for the long term. I
> > > > > > > have some hope to be able to add a GPU scheduling algorithm with a bit
> > > > > > > more fairness than the current one sometime in the future, which
> > > > > > > requires execution time tracking on the entities.
> > > > > > Danilo,
> > > > > >
> > > > > > Using kref is preferable, i.e. option 1 above.
> > > > > I think the only real motivation for doing that would be for generically
> > > > > tracking job statistics within the entity a job was deployed through. If
> > > > > we all agree on tracking job statistics this way I am happy to prepare a
> > > > > patch for this option and drop this one:
> > > > > https://lore.kernel.org/all/20230331000622.4156-1-dakr@xxxxxxxxxx/T/#u
> > > > Hmm, I never thought about "job statistics" when I preferred using kref above.
> > > > The reason kref is attractive is because one doesn't need to worry about
> > > > it--when the last user drops the kref, the release is called to do
> > > > housekeeping. If this never happens, we know that we have a bug to debug.
> > >
> > > Yeah, reference counting unfortunately have some traps as well. For
> > > example rarely dropping the last reference from interrupt context or
> > > with some unexpected locks help when the cleanup function doesn't expect
> > > that is a good recipe for problems as well.
> > >
> > > > Regarding the patch above--I did look around the code, and it seems safe,
> > > > as per your analysis, I didn't see any reference to entity after job submission,
> > > > but I'll comment on that thread as well for the record.
> > >
> > > Reference counting the entities was suggested before. The intentionally
> > > avoided that so far because the entity might be the tip of the iceberg
> > > of stuff you need to keep around.
> > >
> > > For example for command submission you also need the VM and when you
> > > keep the VM alive you also need to keep the file private alive....
> >
> > Yeah refcounting looks often like the easy way out to avoid
> > use-after-free issue, until you realize you've just made lifetimes
> > unbounded and have some enourmous leaks: entity keeps vm alive, vm
> > keeps all the bo alives, somehow every crash wastes more memory
> > because vk_device_lost means userspace allocates new stuff for
> > everything.
>
> Refcounting everywhere has been working well for us, so well that so far all
> the oopses we've hit have been... drm_sched bugs like this one, not anything
> in the driver. But at least in Rust you have the advantage that you can't
> just forget a decref in a rarely-hit error path (or worse, forget an incref
> somewhere important)... ^^
>
> > If possible a lifetime design where lifetimes have hard bounds and you
> > just borrow a reference under a lock (or some other ownership rule) is
> > generally much more solid. But also much harder to design correctly
> > :-/
> >
> > > Additional to that we have some ugly inter dependencies between tearing
> > > down an application (potential with a KILL signal from the OOM killer)
> > > and backward compatibility for some applications which render something
> > > and quit before the rendering is completed in the hardware.
> >
> > Yeah I think that part would also be good to sort out once&for all in
> > drm/sched, because i915 has/had the same struggle.
> > -Daniel
> >
>
> Is this really a thing? I think that's never going to work well for explicit
> sync, since the kernel doesn't even know what BOs it has to keep alive for a
> job... I guess it could keep the entire file and all of its objects/VMs/etc
> alive until all of its submissions complete but... ewww.
>
> Our Mesa implementation synchronously waits for all jobs on context destroy
> for this reason, but if you just kill the app, yeah, you get faults as
> running GPU jobs have BOs yanked out from under them. Kill loops make for a
> good way of testing fault handling...

You wind down the entire thing on file close? Like
- stop all context
- tear down all context
- tear down all vm
- tear down all obj

Just winding things down in a random order and then letting gpu fault
handling sort out the mess doesn't strike me as particularly clean design
...

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch