Re: [Regression] drm/scheduler: track GPU active time per entity

From: Luben Tuikov
Date: Thu Apr 06 2023 - 18:06:33 EST


On 2023-04-06 11:58, Lucas Stach wrote:
> Am Mittwoch, dem 05.04.2023 um 16:44 -0400 schrieb Luben Tuikov:
>> On 2023-04-05 13:44, Lucas Stach wrote:
>>> Hi Luben,
>>>
>>> Am Dienstag, dem 04.04.2023 um 00:31 -0400 schrieb Luben Tuikov:
>>>> On 2023-03-28 04:54, Lucas Stach wrote:
>>>>> Hi Danilo,
>>>>>
>>>>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>>>>> Hi all,
>>>>>>
>>>>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>>>>> tries to track the accumulated time that a job was active on the GPU
>>>>>> writing it to the entity through which the job was deployed to the
>>>>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>>>>> which fetches a job from the schedulers pending_list.
>>>>>>
>>>>>> Doing this can result in a race condition where the entity is already
>>>>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>>>>> once the job is fetched from the pending_list.
>>>>>>
>>>>>> After drm_sched_entity_destroy() being called it should be safe to free
>>>>>> the structure that embeds the entity. However, a job originally handed
>>>>>> over to the scheduler by this entity might still reside in the
>>>>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>>>>> already being called and the entity being freed. Hence, we can run into
>>>>>> a UAF.
>>>>>>
>>>>> Sorry about that, I clearly didn't properly consider this case.
>>>>>
>>>>>> In my case it happened that a job, as explained above, was just picked
>>>>>> from the schedulers pending_list after the entity was freed due to the
>>>>>> client application exiting. Meanwhile this freed up memory was already
>>>>>> allocated for a subsequent client applications job structure again.
>>>>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>>>>> reproduce the same corruption over and over again by just using
>>>>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>>>>
>>>>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>>>>> I miss something), which is why I'm writing this mail instead of sending
>>>>>> a fix directly.
>>>>>>
>>>>>> Spontaneously, I see three options to fix it:
>>>>>>
>>>>>> 1. Rather than embedding the entity into driver specific structures
>>>>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>>>>> reference count it, such that it's only freed up once all jobs that were
>>>>>> deployed through this entity are fetched from the schedulers pending list.
>>>>>>
>>>>> My vote is on this or something in similar vain for the long term. I
>>>>> have some hope to be able to add a GPU scheduling algorithm with a bit
>>>>> more fairness than the current one sometime in the future, which
>>>>> requires execution time tracking on the entities.
>>>>
>>>> Danilo,
>>>>
>>>> Using kref is preferable, i.e. option 1 above.
>>>>
>>>> Lucas, can you shed some light on,
>>>>
>>>> 1. In what way the current FIFO scheduling is unfair, and
>>>> 2. shed some details on this "scheduling algorithm with a bit
>>>> more fairness than the current one"?
>>>
>>> I don't have a specific implementation in mind yet. However the current
>>> FIFO algorithm can be very unfair if you have a sparse workload compete
>>> with one that generates a lot of jobs without any throttling aside from
>>> the entity queue length.
>>
>> Ah, that's interesting, let's see, a "sparse workload compete with one that
>> generates a lot of jobs", so basically we have a sparse workload compete
>> with a dense workload. So we can represent this with two entities, A, B,
>> whose jobs we're going to represent by the entities, names A and B.
>> So if we let A be the sparse workload and B the dense workload,
>> we have this, wlog,
>>
>> First/oldest job, .........................., latter/new jobs.
>> Subm: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
>> Time: t0,t1,t2,t3,t4,t5,t6,t7,t8,t9, .....
>>
>> The current FIFO algorithm, would prefer to execute those jobs
>> in order of submission, i.e. oldest-ready-first job. Assume
>> that all jobs are ready. Then we'll execute them in order.
>> This is desirable and fair. We want to execute the jobs
>> in the order they were submitted, given also that they are
>> ready to be executed. So perhaps we want to execute them like this:
>>
>> First/oldest job, .........................., latter/new jobs.
>> Subm: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
>> Time: t0,t1,t2,t3,t4,t5,t6,t7,t8,t9, ....
>> Exec: A, B, B, B, B, B, A, B, B, B, B, B, A, B, B, B, B, B, A, ...
>>
>> Any other ordering would starve either A, or B. If we executed the 2nd A
>> job at t6 or t7, then that would starve the 3rd/4th job in B, since the 2nd A job
>> arrives at the same time as that of the 3rd B job, at time t6.
>> The time t3-t0 is some delta > 0, some initial scheduler-start up time.
>>
> For simplicity now also assume that all jobs from A take 5ms of GPU
> time, while jobs from B take 50ms of GPU time.
>
>> IOW, we don't want to delay a job any more than it should wait--the oldest
>> job, which is also ready, should execute next, so that we're fair how
>> it executes in real time. We cannot boot B at t6, so that we execute A,
>> just because it is sparse, but just arrived.
>>
>> From A's point of view, it shouldn't expect its job execution time distribution
>> to be any different than its submission time distribution.
>>
>> Do you think there's a job permutation which offers a fairer scheduling
>> than the Exec line above for the Submission line above?
>>
> Yes, if we try to keep some fairness of actual GPU time made available
> to each entity by tracking the time spent over a sampling interval, we
> would see that at t6 B has already spent 100ms of GPU time, while A has
> only spent 5ms, so naturally we would prefer the newly submitted job
> from entity A when choosing the next job to schedule.

The problem is that you cannot inquire a priori about the actual
time the next task (job) would take. It might be the case
that the next A job would take a long time while the next B would take
very small amount of time.

>
>>> By tracking the actual GPU time consumed by
>>> the entities we could implement something with a bit more fairness like
>>> deficit round robin (don't pin me on the specific algorithm, as I
>>> haven't given it much thought yet).
>>
>> Since there's no preemption, this would be hard to achieve--you're at the mercy
>> of the execution time of job A_i for an entity A job i. (Assuming there's no
>> preemption as it is the current state of the GPU scheduler.)
>>
>> The only thing you can do, is punish the next job from this entity, A_i+1,
>> to execute much later. However, you don't know how long A_i+1 would take. If A_i+1
>> takes very little time, then you're better off executing it at the next opportune
>> time, i.e. when it would normally execute. But such an algorithm, which doesn't
>> know a priori the execution time of a job, would punish A_i+1 to execute much later.
>>
>> But if A_i+1 takes time as long or longer than A_i, then punishing it to execute much
>> later, would simply delay it, from an observer's point a view, it wouldn't teach
>> the context to submit smaller jobs, so that GPU sharing is more fair.
>
> Without preemption we can't really make the scheduling absolutely fair,
> but we can still do better than the simple FIFO. If a entity has taken

This is exactly my point as I stated above "Since there's no preemption..."

> up a disproportionate amount of GPU time over the last sampling
> interval, we can derate it when picking the next job to schedule to
> allow other entities to take up more GPU time in the next sampling
> interval.

The problem is that you don't know how long that new/other job would
take--it may be a long-time job as well. What if the punished context's
next job was a small quick job, but because of its previous job, it's now
been punished to wait a long time. What exacerbates the problem is if
the favourite new context picked has a big long job as the next job
because its previous job was a quick one.

The base problem with DRR, and similar scheduling algorithms when
applied to the GPU scheduling is that you don't know how long the next
job would take, a priori, and make decisions on context's past jobs. In
classical DRR applied to networking, you send bytes (the work done) until its
quanta is exhausted and then you move on to another one--which is preemption.

This approach would work if you get a hint from userspace as to the size
of jobs a context would send, i.e. knowing a priori. Then you can formulate
a context permutation which would achieve fairness, somewhat.

Are you seeing contexts sending lots of big jobs very frequently and other
contexts sending small jobs very infrequently?

What you could do, is use an adjusted-time-spent scheduling algorithm,
where you calculate the running average of a context's (past) jobs,
and schedule it with a frequency, such that the average time spent on the GPU
is about the same as that of other context's jobs.

Regards,
Luben