Re: [PATCH v2] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()

From: Luben Tuikov
Date: Wed Apr 19 2023 - 12:29:07 EST


On 2023-04-19 06:07, Lucas Stach wrote:
> Am Mittwoch, dem 19.04.2023 um 10:53 +0100 schrieb Steven Price:
>> On 19/04/2023 10:44, Lucas Stach wrote:
>>> Hi Steven,
>>>
>>> Am Mittwoch, dem 19.04.2023 um 10:39 +0100 schrieb Steven Price:
>>>> On 18/04/2023 11:04, Danilo Krummrich wrote:
>>>>> It already happend a few times that patches slipped through which
>>>>> implemented access to an entity through a job that was already removed
>>>>> from the entities queue. Since jobs and entities might have different
>>>>> lifecycles, this can potentially cause UAF bugs.
>>>>>
>>>>> In order to make it obvious that a jobs entity pointer shouldn't be
>>>>> accessed after drm_sched_entity_pop_job() was called successfully, set
>>>>> the jobs entity pointer to NULL once the job is removed from the entity
>>>>> queue.
>>>>>
>>>>> Moreover, debugging a potential NULL pointer dereference is way easier
>>>>> than potentially corrupted memory through a UAF.
>>>>>
>>>>> Signed-off-by: Danilo Krummrich <dakr@xxxxxxxxxx>
>>>>
>>>> This triggers a splat for me (with Panfrost driver), the cause of which
>>>> is the following code in drm_sched_get_cleanup_job():
>>>>
>>>> if (job) {
>>>> job->entity->elapsed_ns += ktime_to_ns(
>>>> ktime_sub(job->s_fence->finished.timestamp,
>>>> job->s_fence->scheduled.timestamp));
>>>> }
>>>>
>>>> which indeed is accessing entity after the job has been returned from
>>>> drm_sched_entity_pop_job(). And obviously job->entity is a NULL pointer
>>>> with this patch.
>>>>
>>>> I'm afraid I don't fully understand the lifecycle so I'm not sure if
>>>> this is simply exposing an existing bug in drm_sched_get_cleanup_job()
>>>> or if this commit needs to be reverted.
>>>>
>>> Not sure which tree you are on. The offending commit has been reverted
>>> in 6.3-rc5.
>>
>> This is in drm-misc-next - I'm not sure which "offending commit" you are
>> referring to. I'm referring to:
>>
>> 96c7c2f4d5bd ("drm/scheduler: set entity to NULL in
>> drm_sched_entity_pop_job()")
>>
>> which was merged yesterday to drm-misc-next (and is currently the top
>> commit).
>>
>> Is there another commit which has been reverted elsewhere which is
>> conflicting?
>>
> The commit which introduces the use-after-free, which is now upgraded
> to the above NULL ptr dereference is df622729ddbf ("drm/scheduler:
> track GPU active time per entity") and has been reverted in
> baad10973fdb ("Revert "drm/scheduler: track GPU active time per
> entity"). The revert is upstream in 6.3-rc5.

We may actually need to track the effective average time a job
from a context has spent on the GPU, for scheduling purposes.
Just like df622729ddbf ("drm/scheduler: track GPU active time
per entity") does it, but also counting the number of samples
taken, so that we can compare to other entities (contexts),
and pick an appropriate context from which to take out jobs.

When we used Round Robin scheduling, there was a complaint that
some contexts were starved, namely ones with many number of jobs
submitted early, to that of contexts which had just one
or two jobs. This is described in 977d97f18b5b8e
("drm/scheduler: Set the FIFO scheduling policy as the default").

Now there's a complaint that because some contexts submit
many long-lived jobs, possibly back-to-back, while other
contexts submit few short-lived jobs, then because of
the FIFO scheduling, i.e. "oldest ready job executes next",
we get to starve contexts which submit a sparse load of short-lived
jobs.

There are a few scheduling options to make scheduling fairer,
i.e. selecting the next entity from which to get a job to run on
the GPU, and it would seem we might not be able to get away with
"oldest job waiting" or "last time a job was scheduled from
this entity", exactly because of the reasons described
in the previous paragraph. Other scheduling algorithms
are unavailable since we don't know a priori how long a job
would take, and the next best thing is to approximate it
from past jobs--for which we need a measure of the average
job execution time per entity.

In principle, decoupling at the entity <--> job level,
might be too low level to do it at, and perhaps this decoupling
should be done at a higher level, but we might get away
with something as simple as
if (job->entity) {
track time...
}
--
Regards,
Luben

>
> Regards,
> Lucas
>
>
>> Steve
>>
>>> Regards,
>>> Lucas
>>>
>>>> Thanks,
>>>>
>>>> Steve
>>>>
>>>>> ---
>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
>>>>> drivers/gpu/drm/scheduler/sched_main.c | 4 ++++
>>>>> 2 files changed, 10 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index 15d04a0ec623..a9c6118e534b 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>>>> drm_sched_rq_update_fifo(entity, next->submit_ts);
>>>>> }
>>>>>
>>>>> + /* Jobs and entities might have different lifecycles. Since we're
>>>>> + * removing the job from the entities queue, set the jobs entity pointer
>>>>> + * to NULL to prevent any future access of the entity through this job.
>>>>> + */
>>>>> + sched_job->entity = NULL;
>>>>> +
>>>>> return sched_job;
>>>>> }
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 9b16480686f6..e89a3e469cd5 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -42,6 +42,10 @@
>>>>> * the hardware.
>>>>> *
>>>>> * The jobs in a entity are always scheduled in the order that they were pushed.
>>>>> + *
>>>>> + * Note that once a job was taken from the entities queue and pushed to the
>>>>> + * hardware, i.e. the pending queue, the entity must not be referenced anymore
>>>>> + * through the jobs entity pointer.
>>>>> */
>>>>>
>>>>> #include <linux/kthread.h>
>>>>
>>>
>>
>