Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

From: Madadi Vineeth Reddy
Date: Mon Apr 15 2024 - 06:37:18 EST


On 15/04/24 11:10, Ze Gao wrote:
> On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
> <vineethr@xxxxxxxxxxxxx> wrote:
>>
>> Hi Ze Gao,
>>
>> On 13/03/24 14:28, Ze Gao wrote:
>>> We observered select_idle_sibling() is likely to return the *target* cpu
>>> early which is likely to be the previous cpu this task is running on even
>>> when it's actually not within the affinity list newly set, from where after
>>> we can only rely on select_fallback_rq() to choose one for us at its will
>>> (the first valid mostly for now).
>>>
>>> However, the one chosen by select_fallback_rq() is highly likely not a
>>> good enough candidate, sometimes it has to rely on load balancer to kick
>>> in to place itself to a better cpu, which adds one or more unnecessary
>>> migrations in no doubt. For example, this is what I get when I move task
>>> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
>>>
>>> swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
>>> kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
>>>
>>
>> I am able to reproduce this scenario of having an extra migration through load balance
>> swapper 0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
>> ksoftirqd/0 13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
>>
>> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
>> busy task running on CPU 33.
>>
>>> The thing is we can actually do better if we do checks early and take more
>>> advantages of the *target* in select_idle_sibling(). That is, we continue
>>> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
>>> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
>>> especially when the newly allowed cpu set shares some cpu resources with
>>> *target*.
>>>
>>> And with this change, we clearly see the improvement when I move task 3964
>>> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
>>>
>>> swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
>>
>> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
>>
>> On placing some perf probes and testing,
>> migration/57 304 [057] 12216.988491: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
>> swapper 0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
>> swapper 0 [004] 12226.989071: probe:select_fallback_rq: (c0000000001a2e38) cpu=4
>> swapper 0 [004] 12226.989074: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
>> swapper 0 [000] 12227.007768: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
>>
>> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
>> which was CPU 4.
>
> After second thoughts, it indeed could happen if CPU 4 shares nothing
> with CPU 33(34),
> for example, in different numa nodes.
>
> IOW, it cannot benefit from select_idle_siblings() and has to rely on
> select_fallback_rq
> as the last resort. Just like what I said in the changelog, this patch
> aims to improve rq
> selection for cases where the newly allowed cpu set shares some cpu
> resources with
> the old cpu set.

Right. In power 10(where I tested), LLC is smaller being at SMT4 small core
level. So, I think there are less chances of affined CPUs to share resources
with the old CPU set.

I also ran schbench and hackbench just to make sure that there is no regression
in powerpc. Luckily there is no regression.

Tested-by: Madadi Vineeth Reddy <vineethr@xxxxxxxxxxxxx>

Thanks and Regards
Madadi Vineeth Reddy

>
> Sorry for not being able to recall all details immediately, since this
> thread has been inactive
> for a long while and receives no feedback from sched folks.
>
> Best,
> Ze
>
>> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
>> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
>> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
>> default target at the end.
>>
>> Thanks and Regards
>> Madadi Vineeth Reddy
>>
>>>
>>> Note we do the same check for *prev* in select_idle_sibling() as well.
>>>
>>> Signed-off-by: Ze Gao <zegao@xxxxxxxxxxx>
>>> ---
>>