Re: [PATCH] drm/scheduler: set current_entity to next when remove from rq

From: Luben Tuikov
Date: Thu Oct 27 2022 - 11:46:49 EST


So, I started fixing this, including the bug taking the next element as an entity, but it could be actually the list_head... a la your patch being fixed, and then went down the rabbit whole of also fixing drm_sched_rq_select_entity(), but the problem is that at that point we don't know if we should start from the _next_ entity (as it is currently the case) or from the current entity (a la list_for_each_entry_from()) as it would be the case with this patch (if it were fixed for the list_head bug).

But the problem is that elsewhere in the GPU scheduler (sched_entity.c), we do want to start from rq->current_entity->next, and picking "next" in drm_sched_rq_remove_entity(), would then skip an entity, or be biased for an entity twice. This is why this function is called drm_sched_rq_remove_entity() and not drm_sched_rq_next_entity_or_null().

So all this work seemed moot, given that we've already switched to FIFO-based scheduling in drm-misc-next, and so I didn't see a point in developing this further at this point (it's been working alright)--we're going with FIFO-based scheduling.

Regards,
Luben


On 2022-10-27 05:08, Christian König wrote:
> Am 27.10.22 um 11:00 schrieb broler Liew:
>> It's very nice of you-all to finger it out that it may crash when it is the last entity in the list.   Absolutely I made a mistake about that.
>> But I still cannot understand why we need to restart the selection from the list head when the current entity is removed from rq.
>> In drm_sched_rq_select_entity, starting from head may cause the first entity to be selected more often than others, which breaks the equal rule the scheduler wants to achieve.
>> Maybe the previous one is the better choice when current_entity == entity?
>
> That's a good argument, but we want to get rid of the round robin algorithm anyway and switch over to the fifo.
>
> So this is some code which is already not used by default any more and improving it doesn't make much sense.
>
> Regards,
> Christian.
>
>>
>> Luben Tuikov <luben.tuikov@xxxxxxx> 于2022年10月27日周四 16:24写道:
>>
>> On 2022-10-27 04:19, Christian König wrote:
>> > Am 27.10.22 um 10:07 schrieb Luben Tuikov:
>> >> On 2022-10-27 03:01, Luben Tuikov wrote:
>> >>> On 2022-10-25 13:50, Luben Tuikov wrote:
>> >>>> Looking...
>> >>>>
>> >>>> Regards,
>> >>>> Luben
>> >>>>
>> >>>> On 2022-10-25 09:35, Alex Deucher wrote:
>> >>>>> + Luben
>> >>>>>
>> >>>>> On Tue, Oct 25, 2022 at 2:55 AM brolerliew <brolerliew@xxxxxxxxx> wrote:
>> >>>>>> When entity move from one rq to another, current_entity will be set to NULL
>> >>>>>> if it is the moving entity. This make entities close to rq head got
>> >>>>>> selected more frequently, especially when doing load balance between
>> >>>>>> multiple drm_gpu_scheduler.
>> >>>>>>
>> >>>>>> Make current_entity to next when removing from rq.
>> >>>>>>
>> >>>>>> Signed-off-by: brolerliew <brolerliew@xxxxxxxxx>
>> >>>>>> ---
>> >>>>>>   drivers/gpu/drm/scheduler/sched_main.c | 5 +++--
>> >>>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>> >>>>>>
>> >>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> >>>>>> index 2fab218d7082..00b22cc50f08 100644
>> >>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> >>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> >>>>>> @@ -168,10 +168,11 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>> >>>>>>          spin_lock(&rq->lock);
>> >>>>>>
>> >>>>>>          atomic_dec(rq->sched->score);
>> >>>>>> -       list_del_init(&entity->list);
>> >>>>>>
>> >>>>>>          if (rq->current_entity == entity)
>> >>>>>> -               rq->current_entity = NULL;
>> >>>>>> +               rq->current_entity = list_next_entry(entity, list);
>> >>>>>> +
>> >>>>>> +       list_del_init(&entity->list);
>> >>>>>>
>> >>>>>>          if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>> >>>>>>                  drm_sched_rq_remove_fifo_locked(entity);
>> >>>>>> --
>> >>>>>> 2.34.1
>> >>>>>>
>> >>> Looks good. I'll pick it up into some other changes I've in tow, and repost
>> >>> along with my changes, as they're somewhat related.
>> >> Actually, the more I look at it, the more I think that we do want to set
>> >> rq->current_entity to NULL in that function, in order to pick the next best entity
>> >> (or scheduler for that matter), the next time around. See sched_entity.c,
>> >> and drm_sched_rq_select_entity() where we start evaluating from the _next_
>> >> entity.
>> >>
>> >> So, it is best to leave it to set it to NULL, for now.
>> >
>> > Apart from that this patch here could cause a crash when the entity is
>> > the last one in the list.
>> >
>> > In this case current current_entity would be set to an incorrect upcast
>> > of the head of the list.
>>
>> Absolutely. I saw that, but in rejecting the patch, I didn't feel the need to mention it.
>>
>> Thanks for looking into this.
>>
>> Regards,
>> Luben
>>
>